Friday, March 30, 2012

Preferable way to use two servers

I use data from two SQL servers to make up a webpage. What is the preferable
way to fetch the data.
- Make two connection objects and connect to both servers. or
- Fetch data throug one connection by means of a linked server?mike,
From a security point of view, and probably also performance, two
connections from the web server.
--
Mark Allison, SQL Server MVP
http://www.markallison.co.uk
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
mike wrote:
> I use data from two SQL servers to make up a webpage. What is the preferab
le way to fetch the data.
> - Make two connection objects and connect to both servers. or
> - Fetch data throug one connection by means of a linked server?

Preferable way to use two servers

I use data from two SQL servers to make up a webpage. What is the preferable way to fetch the data.
- Make two connection objects and connect to both servers. or
- Fetch data throug one connection by means of a linked server?
mike,
From a security point of view, and probably also performance, two
connections from the web server.
Mark Allison, SQL Server MVP
http://www.markallison.co.uk
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
mike wrote:
> I use data from two SQL servers to make up a webpage. What is the preferable way to fetch the data.
> - Make two connection objects and connect to both servers. or
> - Fetch data throug one connection by means of a linked server?
sql

Preferable way to use two servers

I use data from two SQL servers to make up a webpage. What is the preferable way to fetch the data.
- Make two connection objects and connect to both servers. or
- Fetch data throug one connection by means of a linked server?mike,
From a security point of view, and probably also performance, two
connections from the web server.
--
Mark Allison, SQL Server MVP
http://www.markallison.co.uk
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
mike wrote:
> I use data from two SQL servers to make up a webpage. What is the preferable way to fetch the data.
> - Make two connection objects and connect to both servers. or
> - Fetch data throug one connection by means of a linked server?

Pre-Execute Phase

What happens during the pre-execute phase?

Is there anything I can do to optimize it's execution?

I have a package that takes data from 2 SQL Server sources, unions them, and writes to a sql table.

The pre-execute takes 20 minutes, and the execution takes 45 seconds.

Thanks

BobP

Are you sure there are no other components in your data flow (Lookups?). Do you have another data flow in the package?

Give us a picture about the used metadata; how many columns? How many rows do get transferred?

Thanks.

|||

This is happening on several packages:

1. The example above takes 2 sql server source, with queries returning ~5,000 rows each, unioning them and writing them to a sql table, with no lookups. The pre-execute takes 20 minutes.

2. Every once in a while, a job that I have that runs every night, will freeze on the pre-execute. No errors, the job appears to be still running, but it is not doing anything, no processor activity, etc.

What I am looking for is a "list" of what the pre-execute phase is doing? I have no way of troubleshooting the #2 issue above.

Does anyone have any ideas, or is anyone else experiencing this?

Thanks!

BobP

|||

This is starting to become a big problem, as this pre-execute task is now locking up at least once a week.

Does anybody have any similar experiences?

Thanks!

BobP

|||

Well, I have found the answer.

The Source Query actually executes during the pre-execute phase.

What was happening was I was getting never ending cxpacket waits.

I adjusted my MAXDOP and this seems to fix it.

Pre-Execute Phase

What happens during the pre-execute phase?

Is there anything I can do to optimize it's execution?

I have a package that takes data from 2 SQL Server sources, unions them, and writes to a sql table.

The pre-execute takes 20 minutes, and the execution takes 45 seconds.

Thanks

BobP

Are you sure there are no other components in your data flow (Lookups?). Do you have another data flow in the package?

Give us a picture about the used metadata; how many columns? How many rows do get transferred?

Thanks.

|||

This is happening on several packages:

1. The example above takes 2 sql server source, with queries returning ~5,000 rows each, unioning them and writing them to a sql table, with no lookups. The pre-execute takes 20 minutes.

2. Every once in a while, a job that I have that runs every night, will freeze on the pre-execute. No errors, the job appears to be still running, but it is not doing anything, no processor activity, etc.

What I am looking for is a "list" of what the pre-execute phase is doing? I have no way of troubleshooting the #2 issue above.

Does anyone have any ideas, or is anyone else experiencing this?

Thanks!

BobP

|||

This is starting to become a big problem, as this pre-execute task is now locking up at least once a week.

Does anybody have any similar experiences?

Thanks!

BobP

|||

Well, I have found the answer.

The Source Query actually executes during the pre-execute phase.

What was happening was I was getting never ending cxpacket waits.

I adjusted my MAXDOP and this seems to fix it.

Pre-execute Hangs at 50%

Hello,

I did a search and found no answers. I have a simple project; one data flow task reading from one DB2 tabel and writing to another. My OLEDB source is a sql command that returns 97,000 rows and my OLEDB destination's data access mode is "table or view".

For my OLEDB source SQL command, I run it through the "build query" panel and I get all 97,000 rows back within 90 seconds everytime.

I don't know what is so different since I currently do SQL commands that bring back 2.5 million rows from DB2 with no effort in SSIS. I even took my data flow task and placed into a known good solution - but still hangs at 50% forever. I tried attaching to SSIS server and executing through there, again no luck. Still hung!

Any ideas? Thanks.

On your OLE DB Source component in the data flow, set its property, "ValidateExternalMetadata" to false and see if that helps.

One other thing to try is to set the property of the DB2 connection manager object, "DelayValidation" to true.|||

Thanks for your reply Phil.

I went ahead to set both properties accordingly and reran. After 16 minutes of running it was still on the pre-execute phase stuck at 50%.

Just to test I changed my DB2 SQL to FETCH ROW FIRST ONLY and it still hangs at the pre-execute 50% level. So obviously it's not the number of rows causing the problem.

What I don't understand is using the "preview" panel in the OLEDB source, I get the first row in about 6-7 seconds.

Thanks.

|||Do you have any other connection managers defined in the package? Validation will attempt to validate all connections.

What DB2 driver are you using?|||

No, that is the only connection I have in my package/solution.

I am using "Native OLE DB\IBM OLE DB Provider for DB2".

|||

Is there any way to see a log of where it's getting hung? Maybe its my OLEDB Desitnation...since everything is perfect with my source.

Any ideas on how to verify my idea? Thanks.

|||

I went ahead and changed my destination to a temperary SQL 2005 table and now it works fine...so It is my destination. It must be failing on something (connection/rights?) and not telling me why?|||

It seems MS really dropped the ball on this one.

I'm almost certain that it's failing on a unique_index insert error. Why is it hiding the error message and letting SSIS hang FOREVER? Why does it do the SELECT and INSERT before the "real" execution of my package? This really defeats the purpose!!!

oh well - thanks for your help.

|||

Zach84 wrote:

It seems MS really dropped the ball on this one.

I'm almost certain that it's failing on a unique_index insert error. Why is it hiding the error message and letting SSIS hang FOREVER? Why does it do the SELECT and INSERT before the "real" execution of my package? This really defeats the purpose!!!

oh well - thanks for your help.

Be careful... You aren't using a Microsoft driver. It isn't doing any selects or inserts -- it's just trying to validate metadata and such.

You could always try the Microsoft OLE DB for DB2 driver, which I've had better luck with.

pre-execute failure

I just started getting this error.

[DTS.Pipeline] Error: component "User Type" (377) failed the pre-execute phase and returned error code 0x8007000E.

It wasn't happening before. Does anyone know what it means?

Hi Jim,
Without more information it is probably impossible to say.

What type of component is it?
What are you using it for?
How have you configured it?
When do you get the error - when the package starts or when the data-flow starts?
What inputs does it take?

etc...etc...

Regards
Jamie|||

Excelent questions.
This s a dataflow component. All of it's input is from a table. It seems like just rearranging the dataflow components in the work flow eliminates the problem. Since it isn't happening any longer, I can't do a better job answering your questions.

I don't understand it.

|||Incidentally, that error is Out Of Memory, so there could have been some transient cause.... Please let us know if you do come across a repro.sql

Pre-execute error in simple import/export

I am trying to copy data from a SQL Server 2000 DB to SQL Server 2005 DB using the import/export wizard in SQL Server Management Studio. The two databases are not identical with different table names and different columns but I thought I had set up all the right mappings and had set the 'Enable identity insert' option. When I ran the wizard it errored at the Pre-execute phase. I simplified the wizard down to one table and only as couple of varchar columns and this also errored in the same way. The error report is detailed below.

For reference the SQL2000(ent. edn) DB is on a windows 2000 server and the SQL2005(dev. edn) DB and the management studio are both on my WinXPSP2 workstation.

Could somebody explain why these errors have occured and more importantly how to rectify the problem?

Many thanks,
Michael.

Operation stopped...

- Initializing Data Flow Task (Success)

- Initializing Connections (Success)

- Setting SQL Command (Success)

- Setting Source Connection (Success)

- Setting Destination Connection (Success)

- Validating (Warning)

Messages

Warning 0x80047076: Data Flow Task: The output column "DateAdd" (23) on output "OLE DB Source Output" (11) and component "Source - tccNewsArticles" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.
(SQL Server Import and Export Wizard)

Warning 0x80047076: Data Flow Task: The output column "DateChg" (26) on output "OLE DB Source Output" (11) and component "Source - tccNewsArticles" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.
(SQL Server Import and Export Wizard)

... NOTE: I have removed the rest of the warnings as they were the same as above (many of them).

- Pre-execute (Error)

Messages

Error 0xc0202009: Data Flow Task: An OLE DB error has occurred. Error code: 0x80040E21.
An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80040E21 Description: "Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done.".
(SQL Server Import and Export Wizard)

Error 0xc0202025: Data Flow Task: Cannot create an OLE DB accessor. Verify that the column metadata is valid.
(SQL Server Import and Export Wizard)

Error 0xc004701a: Data Flow Task: component "Destination - Nrs_NewsArticles" (112) failed the pre-execute phase and returned error code 0xC0202025.
(SQL Server Import and Export Wizard)

- Executing (Success)

- Copying to [NereusV2_1].[dbo].[Nrs_NewsArticles] (Stopped)

- Post-execute (Stopped)

- Cleanup (Success)

Messages

Information 0x4004300b: Data Flow Task: "component "Destination - Nrs_NewsArticles" (112)" wrote 0 rows.
(SQL Server Import and Export Wizard)

Michael,

this is most likely a truncation problem. Could you check the sizes of used source and destination columns?

Thanks.

pre-emptive locking solution

Hey all,

I've got a very long stored proc that runs intensive updates on a particular table. The locks are always escalated from Intent eXclusive to eXclusive. After some reading online, I've decided to implement this (http://support.microsoft.com/default.aspx?scid=kb;en-us;323630#kb2) . The idea is to start a transaction with another spid, and hold an incompatible lock on that table so the stored procedure that I'm running isn't able to escalate the lock. The solution works, but it unfortunately means that I've to lock this table with an update lock for the whole stored proc, which I would rather not do.

Is it possible to spawn another stored proc/function/transaction under another spid from within my stored proc ? I'm hoping the answer to my question isn't here (http://www.dbforums.com/t994076.html).

Is it maybe possible to open another connection within my stored proc ? On a similar note, would it be possible to communicate somehow between connections without using a table ?

Thanks,
-KilkaThere are many ways to do this, but Transact-SQL is a bit limited in this area. It can be done, but it is brute force and ugly at best.

Have you investigated DTS? At least in my experience, it handles this kind of processing much better than Transact-SQL can.

-PatP|||I've worked with DTS and I know the only way it could potentially help me was if I used scripting (activeX or something) to acheive the same thing.

I'm trying to keep everything in a scheduled stored proc. If at all possible, I want to do everything in T-sql for performance reasons. This is something that takes hours to run, so any little performance hit has a big impact.

PredVariance for NN

Hi,

We're building a model using the NN algorithm, and had a question about the how the PredVariance value is computed. Our testing data set has ~28K cases, but for some reason when we run our prediction query there are only 10 unique PredVariance values generated. Why doesn't each case with a distinct predicted value have its own PredVariance value?

For e.g. here are 3 different PredValues that all have the same PredVariance (229985900) and each has a TrueVal of 0:

15307.6681537296
17759.1791905724
1843.85682442577

If you need more specific info please let me know.

Thanks.

Hello

PredictVariance outputs the error variance for the subnetwork used in executing the prediction. This is detected during training for each subnetwork and, for all predictions executed on that respective subnetwork, PredictVariance will return the same value (same applies to PredictStdev).

If your target variable is a continuous one, then there are at most 2 subnetworks built for the variable (one for the value, one for the probability of Missing state, assuming your training data contains missing values).

If your target variable is discrete or discretized, then one subnetwork is trained to predict the probability of each individual state (including Missing).

Assuming that a prediction returns TargetValue1, the associated PredictVariance should return always the same value (the error variance of the TargetValue1 subnetwork)

You mention that there are 10 unique PredVariance values generated -- it seems that your target variable is discrete with at least 10 distinct states, is this correct?

thanks

|||

Thanks Bogdan,

Our target variable is actually continuous. That's what initially prompted the question, as we expected a PredVariance to be computed for all distinct target variable values (of which there are as many as there are cases).

Does that clarify the question?

Thanks.

|||

The error variance is not computed for each distinct target variable value, but for each subnetwork.

Here are the steps for one continuous target:

- partition the training set in two blocks (training and holdout -- the HOLDOUT_PERCENTAGE and SAMPLE_SIZE parameters control the size of the partitions)

- Iteratively train one subnetwork based on the training partition, and estimate the error (at each step) based on the holdout partition

- at the end -- compute the error variance of the trained subnetwork over all the cases the holdout partition

Therefore, there is a single variance value for the whole subnetwork that predicts one continuous target.

At prediction time, Predict(Target) runs the subnetwork for the target and returns the result, PredictVariance simply returns the (pre-computed) variance for the respective subnetwork (mapped from the normalized training space to the original input space). Therefore, the value returned by PredictVariance should always be the same.

You mentioned that there are 10 distinct variance values being returned?

|||i'll read your feedback more thoroughly this evening, but I wanted to answer your question...yes, there are 10 distinct PredVariance values.|||

Hi,

It's now clear, based on your response, why we have 10 distinct PredVariance values. We were using 10-fold cross-validation and each fold has a corresponding PredVariance.

Going to follow-up with one additional post/question before we close it out....stay tuned. Smile

Thanks,

mike

PredictSupport

Hi,

Can anyone explain what this function returns? What does the support value represent?

Thanks,

Dave

PredictSupport(<attribute>, <state>) returns the number of cases in the training set that support the predicted state for this attribute. If the state is not specified, the state with the highest predict probability is used. The general idea is that a high probability prediction with a larger support value may be more reliable.|||

Ah right, that makes sense.

Thanks!

sql

PredictProbability with Association Rule model..

I have run into a .. somewhat of a "duh" question. I'm running association rule to run a basket analysis, and I'm trying to get probability of each prediction. I know this is wrong, but how do I go about running PredictProbability on each ProductPurchase prediction?

When I run the below DMX query, I get this error message...

Error (Data mining): the dot expression is not allowed in the context at line 5, column 25. Use sub-SELECT instead.

Thanks in advance...

-Young K

SELECT
t.[AgeGroupName]
, t.[ChildrenStatusName]
, (Predict([Basket Analysis AR].[Training Product], 3)) as [ProductPurchases]
, (PredictProbability([Basket Analysis AR].[Training Product].[ProductName])) as [ProductPurchases]
From
[Basket Analysis AR]
PREDICTION JOIN
OPENQUERY([DM Reports DM],
'SELECT
[AgeGroupName]
, [ChildrenStatusName]
FROM
[dbo].[DM.BasketAnalysis.Contact]
WHERE isTrainingData = 0
') AS t
ON
[Basket Analysis AR].[Age Group Name] = t.[AgeGroupName]
AND [Basket Analysis AR].[Children Status Name] = t.[ChildrenStatusName]

You can actually get the statistics directly from the Predict function call:

Note the extra flag for the Predict function (INCLUDE_STATISTICS) and the removal of the PredictProbability call. PredictProbability will not work for nested table columns

SELECT
t.[AgeGroupName]
, t.[ChildrenStatusName]
, (Predict([Basket Analysis AR].[Training Product], INCLUDE_STATISTICS, 3)) as [ProductPurchases]
From
[Basket Analysis AR]
PREDICTION JOIN
OPENQUERY([DM Reports DM],
'SELECT
[AgeGroupName]
, [ChildrenStatusName]
FROM
[dbo].[DM.BasketAnalysis.Contact]
WHERE isTrainingData = 0
') AS t
ON
[Basket Analysis AR].[Age Group Name] = t.[AgeGroupName]
AND [Basket Analysis AR].[Children Status Name] = t.[ChildrenStatusName]

The results results will include the probability/support and adjusted probability for each of the predicted items

|||Thanks so much. :-D

Predictions

Hello all,
I was handed a new project and I'm not quite sure how to
do it or even what technology to use. Now, I think that
SQL Server's Analysis Services could help (or even do the
entire thing) but I'm not sure. So, basically, I'm
looking for recommendations and ressources on the subject.
Here's my scenario:
We have a Data-Warehouse that stores information about
the production of our company. Around this fact table, we
have multiple dimensions including one about the origin
of the production (who) and a time dimension (when).
We are building cubes and linking Excel to them for
future analysis. Everything is done. Now, the all-mighty
supervisors would like to have some predictions built-in
to the system for the rest of the year based on the
previous years. Basically, we would be adding production
values for every possible time (the step is every hour,
so the sum of the production comming from a certain
origin is set as the fact and linked to a time dimension
row for that hour).
As you can see, it's not very difficult and it seems to
me like Analysis Services might already have that
functionnality although I couldn't find it (seems like
predictions can only be made by looking at relations
found by the data mining algorithms and these predictions
need to be made by a human because they are not really
registered as "new" facts or rows but more like new
columns).
So, in short, I'd like to have a tool calculate future
productions for the rest of the year and set a new column
(or dimension) to a certain value indicating that this
fact row has been predicted and is not certain. Of
course, the refreshing of data would not create duplicate
rows for predictions, newer predictions (probably based
on more facts) would replace the last equivalent
prediction.
I hope that's clear.
What do you recommend I use? Is it a case where I will
need to build my own separate program (that updates the
data-warehouse) or is it possible to achieve this using
data mining?
Thanks a lot,
Skip.You could use the datamining prediction task in DTS. I have used this before
and it seems to work well. The accuracy of your model (and suitability of y
our data) can be tested directly by applying the prediction task to the sour
ce data. This task as far a
s I used it is suitable for Y/N type questions - eg given a male living in L
ondon in a certain wage-bracket, would they be likely to join our gym. I can
't see how this type of prediction tool can be used for numerical data apart
from bracketed ranges. If
this IS suitable for you, then it's a simple matter to take the predicted da
ta and load it into the fact table and then do an incremental update of the
cube - eg in DTS.
Regards,
Paul Ibison

Prediction with many attribute states

I have a large dataset of around 3 million records with accounting data for 2 years. Attributes are transaction amount (cont. / predict), account, cost centre, project, month and a few others. I want to predict any future transaction amount for a certain combination. For example; what will the next salary cost transaction amount in cost centre 123 probably be?

I have tried Decision trees and Neural nets. But the predictions are not good enough even if there should be clear patterns in normal accounting data.

I guess the problem is that many of the input attributes have many states. There are around 500 account, and 1000 cost centres, and 2000 projects etc. And the Decision tree doesn’t seem to be able to capture all the business rules in the company. I have tried to group the attribute states into groups based on their average amount, their parent account etc, but it doesn’t seem to solve the problem.

Please post any suggestion you might have how to improve the prediction. I will try them all and post back my findings!

/Erik

You may need to structure your model so that it creates independent models for all scenarios. Also if things like "Project" only have a few rows/state, there's likely not alot to learn from them.

To create independent models you need to move one of your attributes to a nested table. For example, if you thought that "accounts" were the most important you would create a model like this

CREATE MINING MODEL CostByAccount
{
Transaction LONG KEY,
AccountAmount TABLE
{
Account TEXT KEY,
Amount FLOAT CONTINUOUS PREDICT_ONLY
}
CostCenter LONG DISCRETE,
Project LONG DISCRETE,
Month TEXT DISCRETE,
...
} USING Microsoft_Decision_Trees(params)

This will create a different tree for each account based on input only for that account. To create this table in the UI, you will mark the source table as case and nested tables and then add Account as Key of the nested table.

|||

Thanx Jamie,

Seems like a good idea. Creating a forrest instead of a tree. The result looks as expected when browsing the created tree structres in the model viewer.

1. But is this kind of model supported by the accurancy chart? Can't seem to get it working. I add the case table, and the nested table (same table twice). But the drop-down "Predictabel column name" is empty.

2. How to write the predict query? Have used the query builder but it dosn't seem to work.

/Erik

|||

Actually, no, it doesn't work with the accuracy chart, so you would have to create your own accuracy test queries.

For predict, you should be able to do Predict(<Nested Table Name>,3) for example to get the 3 most likely categories. There are also additional tricks you can play, for example to get statistics you can do

Predict(<Nested Table Name>,INCLUDE_STATISTICS)

This will return all possible states with descriptive stats for each state. Since these functions return tables you can select from them, e.g.

SELECT (SELECT * FROM Predict(<Nested Table>, INCLUDE_STATISTICS) WHERE $Probability >0.25) as Result FROM MyModel ...

Will return all states with a 25% probability or higher.

|||

Can't follow you,

This is approx. what I would like to do. But it dosnt work. (A simplified version of the real model).

/Erik

SELECT
t.[TransactionID],
t.[Account],
t.[CostCentre],
t.[Project],
(t.[Amount]) as [ActualAmount],
(SELECT ([Amount]) as [EstimatedAmount] FROM [DesTree].[Transactions])
From
[DesTree]
PREDICTION JOIN
SHAPE {
OPENQUERY([Adb2],
'SELECT DISTINCT
[TransactionID],
[Account],
[CostCentre],
[Project],
[Amount]
FROM
[dbo].[Transactions]
ORDER BY
[TransactionID]')}
APPEND
({OPENQUERY([Adb2],
'SELECT
[Account],
[Amount],
[TransactionID]
FROM
[dbo].[Transactions]
ORDER BY
[TransactionID]')}
RELATE
[TransactionID] TO [TransactionID])
AS
[Transactions] AS t
ON
[DesTree].[Cost Centre] = t.[CostCentre] AND
[DesTree].[Project] = t.[Project] AND
[DesTree].[Transactions].[Account] = t.[Transactions].[Account] AND
[DesTree].[Transactions].[Amount] = t.[Transactions].[Amount]

|||

I think you want to do your nested select like this

SELECT FLATTENED

t.[TransactionID],
t.[Account],
t.[CostCentre],
t.[Project],
(t.[Amount]) as [ActualAmount],

(SELECT Account, Amount FROM Predict(Transactions) WHERE Account='MyAccount') as Prediction

FROM ...

The only problem here is that you can't compare the nested account to your input - only to a static string or parameter. E.g you can do WHERE Account=@.Account, but you can't do WHERE Account=t.Account.

Prediction Query in MS Association Rules

Hi!

I'm building a mining model wiht MS Association Rules. After processing this model, the result includes some rules(example):

E = Existing, C = Existing -> B = Existing
F = Existing -> E = Existing
C = Existing, B = Existing -> E = Existing
F = Existing -> B = Existing
B = Existing, A = Existing -> C = Existing
F = Existing, B = Existing -> E = Existing
F = Existing, E = Existing -> B = Existing
D = Existing -> A = Existing
C = Existing -> A = Existing
E = Existing, A = Existing -> B = Existing

I want to buid a query that has two or more items on the left of the rules, example: E = Existing, C = Existing -> B = Existing
->I want to buid a query to predict that: when a customer buy 'E' and 'C' then he likely buys 'B'


All the rules are used when you use AR for prediction. The first place to look is the prediction query builder. There is a button on the top to switch the mode from batch to singleton. With a singleton prediction you can manually specify the inputs for your query.

The prediction function you need to specify is something like "Predict(<my nested table name>, 5)". To build such a prediction in the query builder, select Prediction Function, then Predict, then type the name of your nested table, comma, then the number of recommendations you want into the parameters box.

To see the query select the SQL mode from the toolbar.

Let me know if this helps or if you were looking for some other type of answer

THanks

-Jamie

|||

Hi!

Thanks for interesting in my question!

My domain has two tables: Customer (Customer_ID, Name, ....) and Purchase (Customer_ID, Product_Name, Quantity,...)

Creating Mining Model:

Create Mining Model ProductPredict{

Customer_ID long key,

Purchase Table Predict {Product_Name text key}

}

So, when i buid a query such as:

Select Predict(Purchase, 3)

From ProductPredict Prediction Join

(Select 'A' As Product_Name

) as customer

On [ProductPredict].[Purchase].Product_Name = customer.Product_Name;

Result as all item in the right side of the rules contain 'A' in the left side.

But I want to buid a query that result as all item in the right side of the rules contain 'A' and 'B' in the left side.?

Summary: I want to buid a query that result as all item in the right side of the rules contain some items in the left side?

|||

You need a query such as

Select Predict(Purchase, 3)

From ProductPredict Prediction Join

(select

(Select 'A' As Product_Name UNION Select 'B' AS Product_Name)

as Products

) as customer

On [ProductPredict].[Purchase].Product_Name = customer.Product_Name;

This will cause rules with A and B to fire. You may still get predictions based on A alone and B alone, though, depending on their probability and lift.

|||

Hi!

Thank you very much! That's interesting, but when i run that query, it has error, so the correct query is:

Select PredictAssociation(Purchase, 3)

From ProductPredict Prediction Join

(

select

( Select 'A' As Product_Name

UNION

Select 'B' AS Product_Name

)as Products

) as customer

On [ProductPredict].[Purchase].Product_Name = [customer].[Products].Product_Name

|||Predict is a polymorphic function - DMX maps it to the appropriate function based on the model that's being queried. In this case, it maps to PredictAssociation so you should get the same results either way. What errors did you see with the earlier query?

Prediction Query for a "weighted" clustering model

I have a question about writing a prediction query against a clustering model that has the same column added more than once.

Per Jamie, I can accomplish some crude weighting by adding a column to my model multiple times. See this post for an explnation... Now that I have that worked out, I was wondering how my DM query would look? If I have Input_A1, Input_A2 , & Input_A3 all being source from the same column in my structure do I have to reference all three when writing my prediction query?

to be most theoretically accurate, yes, however, I would check to see how the results change for your particular model as you change inputs. If you don't have any missing data, it may not make a significant difference.sql

prediction on multi columns

i have mining model with 20 columns; 10 columns are for data (A1,A2...A10)
and 10 columns are for prediction (B1,B2...B10) data is not in nest table, just one table
using Association Rules
A1 text
A2 text
...
A10 text

B1 text prediction only
B2 text prediction only
...
B10 text prediction only

i have rules as form Ai-->Bj.

i want to make a statement to prediction Bj values when i have Ai values, with Ai get from some textbox on screen, Can you show me some Examples.

Thanks

You can use PredictAssociation() in a DMX statement to get the rules. In your example, the satement will look like:

Select

PredictAssociation(Ai, INCLUDE_STATISTICS, n)

FROM

[Model]

NATURAL PREDICTION JOIN

(SELECT Value as Ai) AS T

Where Ai is replaced with the specific A column you're providing as input and the Value is the Value of the Ai column.

Hope this helps

Prediction Join to MDX with nested table

If your prediction join is to a SQL datasource, you can easily write a SQL query which returns a nested table like:

SELECT
Predict([Subcategories],2) as [Subcategories]
FROM
[SubcategoryAssociations]
NATURAL PREDICTION JOIN
(SELECT
(SELECT 'Road Bikes' AS Subcategory
UNION SELECT 'Jerseys' AS Subcategory
) AS Subcategories
) AS t

What about if your datasource is a cube? Is there some special MDX syntax similar to the SQL syntax above? Or do you have to utilize the SHAPE/APPEND syntax as follows?

SELECT t.*, $Cluster as ClusterName
FROM [MyModel]
PREDICTION JOIN
SHAPE {
select [Measures].[My Measure] on 0,
[My Dimension].[My Attribute].[My Attribute].Members on 1
from MyCube
}
APPEND (
{
select [Measures].[Another Measure] on 0,
NON EMPTY [My Dimension].[My Attribute].[My Attribute].Members
*[Product].[Product].[Product].Members on 1
from MyCube
}
RELATE [[My Dimension]].[My Attribute]].[My Attribute]].[MEMBER_CAPTION]]]
TO [[My Dimension]].[My Attribute]].[My Attribute]].[MEMBER_CAPTION]]]
)
AS [My Nested Table] AS t
ON [MyModel].[Product].[Product] = t.[My Nested Table].[[Product]].[Product]].[Product]].[MEMBER_CAPTION]]]

Typically, for building models on top of cubes, it is much easier to use the tools (BI Dev Studio). This way you can define your model directly on top of the cube and lots of optimizations occur. With such models, you can even use the MDXPredict function to get prediction results inside MDX queries over the source cube.

The DMX SELECT statement supports as input rowset-returning Analysis Services statements (MDX or DMX). That means that dataset-returning statements are not supported. But many MDX queries can be flattened. Have you tried something like SELECT FLATTENED in the MDX query?

|||

Bogdan-

Thanks for the reply. Yes, BIDS worked great for building the model. I've got it trained. Now I want to do a prediction based upon data from a cube. From what I can tell, you can't do prediction queries off a cube using BIDS because it only lets you predict off a relational table source. Right?

I've been researching the MDX function "Predict" which you mentioned. But I'm having terrible trouble finding example queries using that function...

Here's what I'm looking for... we've built a clustering model to cluster our stores. Some of the attributes are just Store dimension attributes... some are from a nested table (stats about the sales volume from each product category). We trained the model with all the stores. Now we want to extract the cluster name for each store and save that to a table. So is there a straight MDX query using the Predict MDX function which will get me the cluster name for every store? I was having trouble seeing how the Predict MDX function was able to know how to do a prediction join to the Store dimension.

As a side note, we could almost do a natural prediction join back to (select * from Model.CASES) except that we don't want the Store Key to influence the clustering model so we didn't add that as an input to the model. (And marking Store Key as Ignore excludes it from the Model.CASES resultset.)

By the way, we're only talking about a couple hundred rows, so the performance of the SHAPE/APPEND syntax below is fine for my purposes... just seeing if there's a more elegant way to do it.

Thanks!

|||

Oh... and to answer your other question about trying "SELECT FLATTENED"...

It's my understanding that "SELECT FLATTENED" is DMX. I'm not sure how to write an MDX statement that starts with "SELECT FLATTENED". And I'm struggling to see how using the DMX "SELECT FLATTENED" would help me. The output of DMX prediction query I used in the examples at the beginning of the thread work fine. I suppose I could flatten the output, but that wouldn't help me much. It's the input to the prediction join that I'm concerned with.

Or did you mean that you can use an MDX query which is written to be flat and use that as input to a prediction query which expects nested tables? I just tried that but may not have been using the right syntax cause I couldn't get it to work. Suggestions?

|||

You kind of need to do it brute force -we use the flattening semantics of MDX when executing the query, so you have to reshape using SHAPE.

There is a little trick to help you out in building the queries. You can use DMX to examine the flattened structure of the MDX query. Just issue a query like this:

SELECT t.* FROM AnyModel NATURAL PREDICTION JOIN <My MDX Query> AS t

then you will be able to see how the DMX processor sees your MDX results.

|||

Jamie-

That trick is helpful for seeing how it refers to the results of an MDX query.

But how do I take a flat MDX query and shape it so it can be consumed by a prediction join which expects a nested table. See the MDX example at the top of this post. Is that the only way (tying two separate MDX queries together with SHAPE/APPEND)?

|||

Yes your original SHAPE/APPEND would be the way to go.

The implementation of SHAPE in the AS engine will cause the MDX query results to be automatically returned in a flattened manner without requiring any explicit flattening syntax in the query itself (in fact, there is no such syntax - flattening is requested as either a command property in XMLA or by requesting a rowset interface in OLE DB)..

prediction in ms sql server 2005

Hey

Does anyone know if the following is possible:

I want to add a column to a table that contains the predicted value according to a decision tree mining model. (I know that this is possible). But now I would like that when a new row is added to this table, and every column except the prediction column is filled in manually, can ms sql server add the predicted value automatically for this row?
I know it is possible to execute a Singleton query for this kind of single prediction, but I would like to integrate this in my data table, because for now my steps would be:
- Create the table with one prediction column
- Add the known values of all columns for one row
- Use singleton query in Mining model prediction tab to know the predicted value
- Fill in the predicted value manually in my table.

I hope my question is clear.

Thanks in advance for the help.

SmileykeYou could probably do this with an INSERT trigger on your SQL Server database table that makes a singleton prediction query via a linked server to the AS server that holds your mining model.|||And how would this query look like?
I mean, you put me in the right direction I think, but I can't make it work.

Thnx|||Please see this article I just posted for details on how to do this: http://www.sqlserverdatamining.com/DMCommunity/TipsNTricks/3914.aspx|||Ok, I tried this, and it was very helpful, but it still doesn't work here.

Do you have any idea why the first query here works, but the second one doesn't? The error is given at the end:

1st working query:
SELECT * FROM OPENQUERY(DMServer,
'SELECT Rings from [Abalone Training Half]')

2nd not working query:
SELECT * FROM OPENQUERY(DMServer,
'SELECT Rings FROM [Abalone Training Half]
NATURAL PREDICTION JOIN
(SELECT I AS Sex,
12 AS Length,
12 AS Diameter,
12 AS Height)
AS T')

The error is:
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "MSOLAP" for linked server "DMServer" reported an error. The provider did not give any information about the error.
Msg 7320, Level 16, State 2, Line 1
Cannot execute the query "SELECT Rings FROM [Abalone Training Half]
NATURAL PREDICTION JOIN
(SELECT I AS Sex,
12 AS Length,
12 AS Diameter,
12 AS Height)
AS T" against OLE DB provider "MSOLAP" for linked server "DMServer".

As you notice, the predicted class is here Rings, and the input attributes are sex, length diameter and height.

I really hope you can still help me.

Smileyke|||

I may be wrong, but the Sex column seems TEXT. In this case, shouldn't "SELECT I AS Sex" be actually "SELECT 'I' AS Sex" ?

In this case, your OPENQUERY should look like below (2 single quotes around I )

SELECT * FROM OPENQUERY(DMServer,
'SELECT Rings FROM [Abalone Training Half]
NATURAL PREDICTION JOIN
(SELECT ''I'' AS Sex,
12 AS Length,
12 AS Diameter,
12 AS Height)
AS T')

|||Thank you. That was indeed the problem.

Now the complete trigger works, so thank you all.

smileyke

Prediction Accuracy

hi,

I am using time series agorithm.I need standard deviation in %. I am using SELECT StudID, PREDICTSTDEV([Perf]) FROM [Stud_Model].This one is giving me the standard deviation like this

StudID stDev

001 2.891298978779

002 2.797288978779.

But I need like this

StudID stDev

001 +50%

002 +51%(From The Previous) like that.

Is it Possible.

Thanks,

Karthik.

To get the standard deviation as a percentage, you just need to get the predicted value and divide e.g.

PredictStdev([Perf])/Predict([Perf]) // of course Predict(Perf) could be 0.

However, I'm not sure what you meant by "From the Previous", though

Prediction Accuracy

hi,

I am using time series agorithm.I need standard deviation in %. I am using SELECT StudID, PREDICTSTDEV([Perf]) FROM [Stud_Model].This one is giving me the standard deviation like this

StudID stDev

001 2.891298978779

002 2.797288978779.

But I need like this

StudID stDev

001 +50%

002 +51%(From The Previous) like that.

Is it Possible.

Thanks,

Karthik.

To get the standard deviation as a percentage, you just need to get the predicted value and divide e.g.

PredictStdev([Perf])/Predict([Perf]) // of course Predict(Perf) could be 0.

However, I'm not sure what you meant by "From the Previous", though

sql

Prediction Accuracy

Hi ,

I am a novice Data Mining Programmer.

I am using Time series algorithm for forecasting.

We are Quite concerned about the accuracy of Prediction output.

For Example Our Data is like this

StudIdDatePerf

00101/01/200190

00102/01/200189

00103/01/200187

00201/01/200159

00202/01/200170

00303/01/200147

If I write my Prediction Query to predict for 100 th time step.Its giving me out put like

DatePerf

03/01/201547.000000115

We are not sure about the accuracy of the values. Is it possible to use trend information as input to my model and make my prediction based on that.

I don’t know how to do that? Can anyone help?

Thanks,

Karthik.

The time series algorithm in SQL Server 2005 - ARTxp is designed for near term prediction accuracy, not far term - e.g. 100 steps. You can get details on the research behind the algorithm at http://research.microsoft.com/~dmax/publications/dmart-final.pdf

Predicting player win over a period of time

I would like to create a simple regression equation to predict player win on their next trip. I have tried to create the model using a linear regression tree based on two players (as a test). The result gives me a single node (expected) with only a coefficient instead of a regression equation. I can do this math by hand to get a regression equation and predicted value for the next trip for each player.

The dataset I used for a simple test is.....

Trip #PlayerWin110011,2501100250210011,4502100275310011,60031002100410012,00041002175

I also tried to predict next trip worth using a forecasting model. I was able to process the model but I was not able to browse the model content in the viewer.

Ultimately, I want to predict next trip worth for individual players off of a cube. The cube has about 1.5- 6M records (multiple records per player) depending on the datasource.

FYI - I have created a working linear regression and a forecasting model off of a cube I think I am setting it up correctly.

Can you provide how the mining structure/model column for the linear regression are set up? More specifically,

1. The datatype for each column in the structure

2. The attribute (Predict, PredictOnly, Input, Ignore) set on each of the model column

Thanks

Shuvro

|||

The variables are all set with a numeric datatype in the table

The attribute for each column:

actual continuous, predict

trip -- continuous, input

win continuous, predict

player account number -- discrete, key

Basically, I want to run predict worth at the player level (return a regression equation for each player) over a large list of players. I am not trying to get one regression equation for the whole universe of players.

In the time series model - is there a limit to the number of unique cases that can be inputed into the model?

Thanks so much for your help.

|||

You can have pretty much any number of series for a time series model, I'm not sure it will help you in this circumstance.

For example, you likely don't want to cross-predict between players. For example you may want a model such as

CREATE MINING MODEL PlayerModel
(
Player TEXT KEY,
Trip LONG KEY TIME,
Win LONG CONTINUOUS PREDICT,
Actual LONG CONTINUOUS PREDICT
) using Microsoft_Time_Series

This would create a time series model that contained models for each player for Win and Actual. However, since they are marked "Predict" rather than "Predict Only" Player A's "Win" values could influence Player B's "Actual" values if the data happened to line up that way.

You would think that you could simply make a model like this

CREATE MINING MODEL PlayerModel
(
Player TEXT KEY,
Trip LONG KEY TIME,
Win LONG CONTINUOUS PREDICT ONLY,
Actual LONG CONTINUOUS PREDICT ONLY
) using Microsoft_Time_Series

This causes the time series to only be based on their own valus, and not of others so this cross-predict thing is not an issue. However, you probably want Actual to influence Win for an individual player, just not other players. The modeling scheme for this situation is to make a seperate model for each player.

Similar things would happen if you tried other regression models such as decision trees, logistic, or linear regression. Your model structure would look like

CREATE MINING MODEL PlayerModel
(
Trip LONG KEY,
Players TABLE
(
Player TEXT KEY,
Actual LONG CONTINUOUS PREDICT ONLY,
Win LONG CONTINUOUS PREDICT ONLY
)
) using Microsoft_Linear_Regression

This model wouldn't do anything (probalby return an error) because there simply are no inputs whatsoever. Again, if you changed a column to Predict (or just left it as Input), the values for different players could influence the regressions for other players. Again, the solution is to create unique models/player.

Predicting player win over a period of time

I would like to create a simple regression equation to predict player win on their next trip. I have tried to create the model using a linear regression tree based on two players (as a test). The result gives me a single node (expected) with only a coefficient instead of a regression equation. I can do this math by hand to get a regression equation and predicted value for the next trip for each player.

The dataset I used for a simple test is.....

Trip #

Player

Win

1

1001

1,250

1

1002

50

2

1001

1,450

2

1002

75

3

1001

1,600

3

1002

100

4

1001

2,000

4

1002

175

I also tried to predict next trip worth using a forecasting model. I was able to process the model but I was not able to browse the model content in the viewer.

Ultimately, I want to predict next trip worth for individual players off of a cube. The cube has about 1.5- 6M records (multiple records per player) depending on the datasource.

FYI - I have created a working linear regression and a forecasting model off of a cube I think I am setting it up correctly.

Can you provide how the mining structure/model column for the linear regression are set up? More specifically,

1. The datatype for each column in the structure

2. The attribute (Predict, PredictOnly, Input, Ignore) set on each of the model column

Thanks

Shuvro

|||

The variables are all set with a numeric datatype in the table

The attribute for each column:

actual continuous, predict

trip -- continuous, input

win continuous, predict

player account number -- discrete, key

Basically, I want to run predict worth at the player level (return a regression equation for each player) over a large list of players. I am not trying to get one regression equation for the whole universe of players.

In the time series model - is there a limit to the number of unique cases that can be inputed into the model?

Thanks so much for your help.

|||

You can have pretty much any number of series for a time series model, I'm not sure it will help you in this circumstance.

For example, you likely don't want to cross-predict between players. For example you may want a model such as

CREATE MINING MODEL PlayerModel
(
Player TEXT KEY,
Trip LONG KEY TIME,
Win LONG CONTINUOUS PREDICT,
Actual LONG CONTINUOUS PREDICT
) using Microsoft_Time_Series

This would create a time series model that contained models for each player for Win and Actual. However, since they are marked "Predict" rather than "Predict Only" Player A's "Win" values could influence Player B's "Actual" values if the data happened to line up that way.

You would think that you could simply make a model like this

CREATE MINING MODEL PlayerModel
(
Player TEXT KEY,
Trip LONG KEY TIME,
Win LONG CONTINUOUS PREDICT ONLY,
Actual LONG CONTINUOUS PREDICT ONLY
) using Microsoft_Time_Series

This causes the time series to only be based on their own valus, and not of others so this cross-predict thing is not an issue. However, you probably want Actual to influence Win for an individual player, just not other players. The modeling scheme for this situation is to make a seperate model for each player.

Similar things would happen if you tried other regression models such as decision trees, logistic, or linear regression. Your model structure would look like

CREATE MINING MODEL PlayerModel
(
Trip LONG KEY,
Players TABLE
(
Player TEXT KEY,
Actual LONG CONTINUOUS PREDICT ONLY,
Win LONG CONTINUOUS PREDICT ONLY
)
) using Microsoft_Linear_Regression

This model wouldn't do anything (probalby return an error) because there simply are no inputs whatsoever. Again, if you changed a column to Predict (or just left it as Input), the values for different players could influence the regressions for other players. Again, the solution is to create unique models/player.

Predicting in Trees

Hi! I have created a DMM using Trees. But when I go to the Mining Model Predition tab and select a Predict function, I get this in the criteria column: <Scalar column reference>[, EXCLUDE_NULL|INCLUDE_NULL][, INCLUDE_NODE_ID]. When select Result, I get this error: "An incorrect number of arguments are used in the function at line 3, column 3." I'm predicting a continuous variable.

But when I delete everything except <Scalar column reference> I get this error: "Parser: The syntax for '<' is incorrect."

When I delete everything in the criteria column, I get this: "Query execution failed."

If I change the criteria to "<Scalar column reference>,INCLUDE_NULL, INCLUDE_NODE_ID" I get the error again that the query execution failed.

I'm working from a data set I created. I had no problems with predictions using clustering, but can't seem to get Trees to work.

Hello,

<Scalar column reference> is supposed to be a placeholder for the actual column name. For example, if you are building a Decision Tree model to predict, say the [Bike Buyer] column (the example in the sample database coming with SQL Server 2005), the function call may look like: Predict( [Bike buyer]) or Predict( [Bike buyer], EXCLUDE_NULL).

Hope this helps

|||Very helpful! Thanks!

Predicting from a clustering model

Hi,

I have built a Clustering model that captures customer demographic information

and identify various hidden clusters based on the information.

What kind of predictions can I make using the above model?

You can do multiple types of queries with a clustering model:

Get the most likely cluster for each case (or a new singleton case) - SELECT Cluster() FROM [model]|||

Thanks Raman,

The 1st, 2nd and the 4th points are clear but I have certain doubts

regarding the 3rd point you mentioned

Firstly as the attributes are customer characteristics like

age, gender and other demographic information and we suppose that the values

for these always exist for all the cases (new/old). What will be the point of

making them predictable and predicting their values again.

Secondly if I want to predict an attribute(s) like

MovieBuyer (determines whether a customer has bought a movie(s) or no) for new

cases, wouldn’t it be better if I use ‘decision trees’ or ‘neural networks’

algorithm rather than clustering.

Basically I cannot imagine a scenario in which clustering

stands out as the best suited algorithm for prediction.

It seems that Clustering is best suited for exploring and

understanding the data rather than for prediction.

I hope I have expressed my problem adequately and clearly.

Any help would be most appreciated.

|||

Your assessment is correct in that clustering is more suitable for data exploration/understanding - I was just pointing out that it is *possible* to do prediction with the SQL Server DM Clustering algorithm (which was your original question).

|||Thanks ramansql

Predicting from a clustering model

Hi,

I have built a Clustering model that captures customer demographic information

and identify various hidden clusters based on the information.

What kind of predictions can I make using the above model?

You can do multiple types of queries with a clustering model:

Get the most likely cluster for each case (or a new singleton case) - SELECT Cluster() FROM [model]|||

Thanks Raman,

The 1st, 2nd and the 4th points are clear but I have certain doubts

regarding the 3rd point you mentioned

Firstly as the attributes are customer characteristics like

age, gender and other demographic information and we suppose that the values

for these always exist for all the cases (new/old). What will be the point of

making them predictable and predicting their values again.

Secondly if I want to predict an attribute(s) like

MovieBuyer (determines whether a customer has bought a movie(s) or no) for new

cases, wouldn’t it be better if I use ‘decision trees’ or ‘neural networks’

algorithm rather than clustering.

Basically I cannot imagine a scenario in which clustering

stands out as the best suited algorithm for prediction.

It seems that Clustering is best suited for exploring and

understanding the data rather than for prediction.

I hope I have expressed my problem adequately and clearly.

Any help would be most appreciated.

|||

Your assessment is correct in that clustering is more suitable for data exploration/understanding - I was just pointing out that it is *possible* to do prediction with the SQL Server DM Clustering algorithm (which was your original question).

|||Thanks raman

PredictCaseLikelihood

I'm working with the cluster analysis algorithm (EM) in SQL 2005. I have tried to find documentation on the function PredictCaseLikelihood without luck. Is there any reference on how this function is defined?

Here's an excerpt from my book Data Mining with SQL Server 2005

PredictCaseLikelihood

PredictCaseLikelihood returns a measure from 0 to 1 that indicates how likely an input case is to exist considering the model learned by the algorithm.This measure is very good for use in anomaly detection as it quickly and easily tells you if new data is similar to any data seen before.This function operates in two modes, normalized and nonnormalized.

In the nonnormalized mode, the value of the measure is the raw probability of the case, that is, the product of the probabilities of each of the attributes in the case.For instance, if the probability of Home Ownership = ‘Yes’ is 40% and the probability of Occupation = ‘Craftsmen’ is 10% then the probability of the case is 40% x 10% = 4%.

Nonnormalized likelihoods can be useful, but due to the nature of the probabilities, as you increase the number of attributes in a case the probability of the case becomes increasingly small.Additionally, as a user, you can not understand if a 4% probability for a certain combination of attributes is a good thing or a bad thing.The normalized likelihood divides the probability of the case as provided by the model by the probability computed without the model using raw statistics.This provides a “lift” number that is normalized between 0 and 1 using the formula (lift)/(lift + 1).This is interpreted that cases with likelihood values greater than 0.5 have positive lift and are more likely than random to occur and that values less than 0.5 have negative life and are less likely than random to occur.

For continuous attributes, the probability distribution is used for this computation.

This query returns the normalized case likelihood for each case in the input set.

SELECT t.id, PredictCaseLikelihood()

FROM CustomerClusters

NATURAL PREDICTION JOIN <Input Set> AS t

This query returns the nonnormalized case likelihood for each case in the input set.

SELECT t.id, CaseLikelihood(NONNORMALIZED)

FROM CustomerClusters

NATURAL PREDICTION JOIN <Input Set> AS t

predict products ( data mining 2000)

i want to make a web page and when somebody come in. i want show for him which products that everyone often buy at that time ( month or summer ).

how i do in data mining to predict that products ?

more: i want know how much percent of product is like by buyer

or i want show products with desc % of the like of people

You can use Microsoft Association Rules to resolve this problem: you can include the time info (such as month or summer) in each trasaction, our Assocation Rules will do the counting and find any rules apply. Please check the live sample http://www.sqlserverdatamining.com/DMCommunity/LiveSamples/54.aspx on about the train Association Rules. You might also want to check out the tips and tricks on our data mining web site http://www.sqlserverdatamining.com. SQL Server books online is another resouce you can use to learn about SQL Server in general and SQL Server Data Mining.

Besides time info, you can also include basket info (other products the user has already chosen) and the user's demographic info(such as Geneder, Income --if you have those info) in your model.

DMX is the langue you can use to do prediction. The function PredictHistogram(Product) will given the list of products and their probability when you run queries against your model.

Good luck,

|||

For SQL Server 2000 you would use Microsoft_Decision_Trees. You will build a model based on your customer/shopping basket table using the wizard. Since 2000 doesn't support creating nested tables in the wizard, you then need to add the nested table from your transaction table in the data mining editor.

Your model will look something like this

CREATE MINING MODEL MyModel
(
BasketID LONG KEY,
Season TEXT DISCRETE
Products TABLE PREDICT
(
ProductName TEXT KEY
)
) USING Microsoft_Decision_Tree

If you have a large number of products (which you likely do) you will have to set the MAXIMUM_INPUT_ATTRIBUTES and MAXIMUM_OUTPUT_ATTRIBUTES on the model as well, otherwise the algorithm will perform feature selection to the default 255.

|||Thanks.sql

predict products ( data mining 2000)

i want to make a web page and when somebody come in. i want show for him which products that everyone often buy at that time ( month or summer ).

how i do in data mining to predict that products ?

more: i want know how much percent of product is like by buyer

or i want show products with desc % of the like of people

You can use Microsoft Association Rules to resolve this problem: you can include the time info (such as month or summer) in each trasaction, our Assocation Rules will do the counting and find any rules apply. Please check the live sample http://www.sqlserverdatamining.com/DMCommunity/LiveSamples/54.aspx on about the train Association Rules. You might also want to check out the tips and tricks on our data mining web site http://www.sqlserverdatamining.com. SQL Server books online is another resouce you can use to learn about SQL Server in general and SQL Server Data Mining.

Besides time info, you can also include basket info (other products the user has already chosen) and the user's demographic info(such as Geneder, Income --if you have those info) in your model.

DMX is the langue you can use to do prediction. The function PredictHistogram(Product) will given the list of products and their probability when you run queries against your model.

Good luck,

|||

For SQL Server 2000 you would use Microsoft_Decision_Trees. You will build a model based on your customer/shopping basket table using the wizard. Since 2000 doesn't support creating nested tables in the wizard, you then need to add the nested table from your transaction table in the data mining editor.

Your model will look something like this

CREATE MINING MODEL MyModel
(
BasketID LONG KEY,
Season TEXT DISCRETE
Products TABLE PREDICT
(
ProductName TEXT KEY
)
) USING Microsoft_Decision_Tree

If you have a large number of products (which you likely do) you will have to set the MAXIMUM_INPUT_ATTRIBUTES and MAXIMUM_OUTPUT_ATTRIBUTES on the model as well, otherwise the algorithm will perform feature selection to the default 255.

|||Thanks.

predict product with sex and age of customer.

i have table:

customer(customerid, age,sex....)

orderdata(orderid, customerid,day)

orderdetails(orderid, productid, quantity)

products(productid, productname,...)

now, i want to show some product for customer when i now him age and sex.

e.ct: if he is a man and age =20 i show product : ball, pull, sport close....... if man is a woman , i show lips, babara, t_shirt, skirt....

if man is a chirdren, i will show joy, story for chidren....

how i create my mining model. and how i query for result in DTS

The easiest way would be first to create a view containing the customer information along with the order id, e.g.

orderid, age, sex, ...

Then you would create a model using that view as the case table, and adding a nested table for products. The product nested table should have a single column - the product id (or product name if you denormalize) - which will be the key. Make the table "PREDICT" (or input and output) and make the non-key columns (i.e. other than orderid) all inputs.

You will then be able to predict based on age and sex like this

SELECT FLATTENED TopCount(Predict(Products),$AdjustedProbability,5) FROM MyModel
NATURAL PREDICTION JOIN
(SELECT 25 as Age, 'Male' as Gender) AS t

Predict Probability in Decision Trees

Hello,

I installed the bike buyer example and i am learning the DMX language. Now i wrote the following query (using MS decision trees):

SELECT
T.[Last Name],
[Bike Buyer],
PredictProbability(Predict([Bike Buyer])) AS [Probability]
From
[v Target Mail]
PREDICTION JOIN
OPENQUERY
(....... And so on..)

Now the result is surprising to me. In the resulttabel all the probabilities are equal.

Bike Buyer Probability
1 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
1 0.99994590500919611

and so on.

Now i am wondering what predictProbability means. I thought that PredictProbability meant the probability that the prediction is correct. Now all the probabilities are the same and the input is different. Can somebody tell me what PredictProbability means or am I using it wrong?

Thanx in advance,

Joris Valkonet

This is an interesting query - I would write it as "PredictProbability([Bike Buyer])" however, the syntax is semantically the same. Another thing to try is PredictNodeId([Bike Buyer]) to see exactly the node used for the prediction and then check that node to see the distribution.

It's possible that PredictProbability(Predict([Bike Buyer])) is exposing a bug and you should use the simpler syntax above.

HTH

-Jamie

|||

Thank you for your reply.
I modified my query to PredictProbability([bike buyer]) and this made no difference. The result is the same.

Then I checked the PredictNodeId([Bike Buyer]) and this is the result:

FirstName

LastName

bike buyer

Expression

Abby

Malhotra

1

000000003

Abby

Prasad

0

000000003

Abby

Rodriguez

0

000000003

Abby

Srini

0

000000003

Abigail

Brown

0

000000003

Abigail

Bryant

1

000000003

Abigail

Davis

0

000000003

Abigail

Flores

0

000000003

Now the nodeId is the rootnode of the tree and the probability of the root node is indeed the value 0.99994590500919611. So if I understand this correct, the rootnode of the decision tree is used for all probability predictions. Is this correct?

(Maybe my query is incorrect and therefore the whole DMX query below)

SELECT
t.[FirstName],
t.[LastName],
[bike buyer],
PredictNodeId([bike buyer]) AS [Predicted NodeId]
From
[v Target Mail]
PREDICTION JOIN
OPENQUERY([Adventure Works DW],
'SELECT
[Gender],
[FirstName],
[MiddleName],
[LastName],
[BirthDate],
[MaritalStatus],
[EmailAddress],
[YearlyIncome],
[TotalChildren],
[NumberChildrenAtHome],
[HouseOwnerFlag],
[NumberCarsOwned],
[AddressLine1],
[AddressLine2],
[Phone]
FROM
[dbo].[ProspectiveBuyer]
') AS t
ON
[v Target Mail].[First Name] = t.[FirstName] AND
[v Target Mail].[Middle Name] = t.[MiddleName] AND
[v Target Mail].[Last Name] = t.[LastName] AND
[v Target Mail].[Birth Date] = t.[BirthDate] AND
[v Target Mail].[Marital Status] = t.[MaritalStatus] AND
[v Target Mail].[Gender] = t.[Gender] AND
[v Target Mail].[Email Address] = t.[EmailAddress] AND
[v Target Mail].[Yearly Income] = t.[YearlyIncome] AND
[v Target Mail].[Total Children] = t.[TotalChildren] AND
[v Target Mail].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[v Target Mail].[House Owner Flag] = t.[HouseOwnerFlag] AND
[v Target Mail].[Number Cars Owned] = t.[NumberCarsOwned] AND
[v Target Mail].[Address Line1] = t.[AddressLine1] AND
[v Target Mail].[Address Line2] = t.[AddressLine2] AND
[v Target Mail].[Phone] = t.[Phone]

|||

Something wierd is happening here - Predict([Bike Buyer]), which is equivalent to saying just [Bike Buyer], returns the highest probability value for the prediction. That means that whatever node the tree goes to, it will return the highest probability state for that node. I can't see how you are getting different prediction results from the same node.

However, something else struck me as odd. The predict probability of 0.999... doesn't seem right. In fact the only time I have ever seen anything like this is when all the data at the node was missing, and the actual non-missing values only recieved the bayesian prior. By default prediction never returns missing (if you are asking for a value, it assumes you want an actual value). If you do a Predict([Bike Buyer], INCLUDE_MISSING) you will get predictions that include the missing value. If there were no observations in the training data of the states 1 and 0 then they would only have their prior probabilities which would be equal and then the choice of a state would be arbitrary.

What confuses me is that PredictProbabililty should behave the same way.

Can you do a SELECT FLATTENED NODE_DISTRIBUTION FROM [v Target Mail].CONTENT and post the results?

Thanks

|||

Thanks for your analysis. I tried to post the whole result of the query "SELECT FLATTENED NODE_DISTRIBUTION FROM [v Target Mail].CONTENT ", but then an unknown error occurs. I think because the resulttable is to big. So... i will give the first top rows of the result.

Thanks....

ATTRIBUTE_NAME

ATTRIBUTE_VALUE

SUPPORT

PROBABILITY

VARIANCE

.VALUETYPE

Bike Buyer

Missing

0

0.000432432

0

1

Bike Buyer

0.494048907

18484

0.999567568

0.249964584

3

Birth Date

5.45E-06

0

0

16892175

7

Birth Date

0.080783333

0

0

0

8

Birth Date

22673.85945

0

0

16892175

9

Date First Purchase

-0.000941648

0

0

76769.40019

7

Date First Purchase

3242.190525

0

0

0

8

Date First Purchase

37852.25763

0

0

76769.40019

9

Number Cars Owned

-0.079859089

0

0

1.295870198

7

Number Cars Owned

301.5988757

0

0

0

8

Number Cars Owned

1.502705042

0

0

1.295870198

9

Number Children At Home

-0.015796323

0

0

2.318367003

7

Number Children At Home

9.530872281

0

0

0

8

Number Children At Home

1.004057563

0

0

2.318367003

9

Total Children

-0.003992138

0

0

2.599718694

7

Total Children

2.60E+01

0

0

0

8

Total Children

1.844351872

0

0

2.599718694

9

Yearly Income

2.02E-06

0

0

1042319181

7

Yearly Income

116.9361238

0

0

0

8

Yearly Income

57305.77797

0

0

1042319181

9

36.04143269

0

0

0.166743652

11

Bike Buyer

Missing

0

0.000205761

0.00E+00

1

Bike Buyer

1

4858

0.999794239

0

3

1

0

0

5.14E-07

11

Bike Buyer

Missing

0

0.000536481

0

1

Bike Buyer

0.507518797

1862

0.999463519

0.249943468

3

Date First Purchase

-0.010374704

0

0

1160.841795

7

Date First Purchase

679.5475673

0

0

0

8

Date First Purchase

37822.50644

0

0

1160.841795

9

Geography Key

0.000170418

0

0

36083.71548

7

Geography Key

1.990727272

0

0

0

8

Geography Key

236.4581096

0

0

36083.71548

9

Number Cars Owned

-0.072103841

0

0

1.254812169

7

Number Cars Owned

2.37E+01

0

0

0

8

Number Cars Owned

1.523630505

0

0

1.254812169

9

Yearly Income

1.29E-06

0

0

1131929217

7

Yearly Income

7.102048444

0

0

0

8

Yearly Income

60633.72718

0

0

1131929217

9

392.8960717

0

0

0.113511443

11

Bike Buyer

Missing

0

0.000123274

0

1

Bike Buyer

0.256103576

8110

0.999876726

0.190514534

3

|||

I see the problem - you have Bike Buyer modeled as continuous where it should be discrete. PredictProbability was giving you the probability that BikeBuyer exists in this case. To get the confidence of the prediction for continuous predictions, you want to use PredictStdev.

Essentially what was happening was that the model was creating a linear regression to predict bike buyer rather than a classification model. Changing the content type to "Discrete" will fix your issue. If you are creating the model from the source table (rather than a cube), you can click on "Detect" on the data types page of the wizard and it will automatically determing that Bike Buyer should be discrete.