Dear All,
I have a dataminig programming that need to run for days. Is it possibile to speed up the training process by clustering several server by Windows 2003 clustering services? Is it actually that clustering 2 QUAD core computer is almost giving comparable performance as the sum of the speed of two (There must be some overhead, I know). I am actually familiary with the use of clustering. Is it just for making the server farm more reliable or it will collaborate and speeed up the whole training process?
If it is, is there any limit on the number of cluster is in the cluster. What version of Windows and SQL Server do I need to achieve speed up of data mining training process?
Thanks and regards
Tony Chun Tung Siu
I think your problem can be reduced at clustering Analysis Services Server because the model is in a AS database.
As is wrote in SQL Server 2005 Failover Clustering White Paper : "The ability to cluster Analysis Services is a new feature in SQL Server 2005. Before SQL Server 2005, the only way to make Analysis Services more available was to either configure it as read-only in a Network Load Balancing cluster, or create it as part of a standard Windows server cluster as a generic resource".
|||The clustering solution will not help you with processing a single mining model. It could help if you need to apply a model to a very large data set (that is assuming you could partition your data set using a few queries, then execute the queries from multiple clients against the cluster -- the workload can be shared between cluster nodes and the performance will scale out almost linearly)
A very unrefined possible workaround, assuming you are training a decision tree model:
- start by building a model on one machine. Use a very large COMPLEXITY_PENALTY factor -- this would give you a very small tree. However, the first split in the tree will be the same as in the actual model you want to use
- create two different views on top of the relational data, one for each branch of the first split.
- train a model on each machine, using only data from one query
- copy both models on a single machine
- write some sort of stored procedure to execute predictions on the server, and have the stored procedure logic figure out which model to actually use (or, in your client application, decide based on the input which model to query)
No comments:
Post a Comment