CRM, ERP, BPM - Berberis | BMS Creative product line

Data Mining

Discover knowledge hidden in your databases

Data mining stands for searching knowledge, hidden somewhere in gigabyte databases. Knowledge is something more than just information, it implies structure i.e., specific correlation, statistical rules or some other dependencies, which can be expressed in terms of mathematics or of a natural language. Of course, it is not easy to find them - sometimes one is even not aware of their existence.

On the other hand, they can be worth many million dollars if e.g., they refer to market reactions important for a given branch. Sometimes to catch them means to be able to predict future, and obviously to gain advantage over competitors. An obvious example would be the extrapolation of stock market quotations, but actually each big company stores different data on the disks of its computers and, dependently on the approach, those data can either have a pure historical value, or they can serve as a source material for interesting market analysis, which total cost may be decreased by an important part, namely that corresponding to data gathering.

Usually, when extracting information from databases, one knows quite well what one is looking for. Building complex, cross reports may be sometimes very complicated technically, but it is always a well defined procedure - a report answers a precise question, like "Display all customers, who purchased goods worth over $10.000 during last month and have not paid for them yet.". Now, the most important about data mining is that we can not ask any precise question. We only wish to know if there is some hidden knowledge in the database.

Some common applications of data mining:

analysis of the churn of customers of telecom operators
- goal: to understand the reason of the phenomenon and to reduce its scale

analysis of the profit generated by customers of retail banks with respect to the cost of bank services
- goal: to create new, profitable bank products

analysis of the content of market baskets
- goal: to optimize the arrangement of goods in supermarkets

searching for correlation between the parameters of a manufacturing process and the final product quality
- goal: to improve quality

inspecting reasons of malfunctioning of process lines
- goal: to warn as soon as possible that a damage may occur

To do data mining one needs dedicated tools, allowing one to notice complex correlation between data stored in big corporate databases. Here our consultants apply products offered by SPSS (mainly Clementine) and Oracle (Darwin, DBMS 9i and higher). For some purpose they also use our own dedicated prediction system N-Expert, based on neural networks.

Three basic, most frequently used data mining techniques, are:

Neural Networks
Decision Trees
Automatic cluster detection.

Neural networks

Artificial neural network is a system which process data parallel in a way human brain does. Although the analogy is (at least so far) rather weak, neural networks reveal surprisingly many features, typical rather for thinking creatures than for traditional silicon computers. It is the essence of neural networks that they can be trained, actually by a lengthy procedure of adjusting a huge number of coefficients "weighting" the processed signals, called synaptic weights. From the human point of view neural networks are black boxes, creating e.g., quite good predictions in their own way. A trained network is a system which reacts on particular input signals in an appropriate way and thus it can be a model of some phenomenon or manufacturing process, predicting its future behavior.

To learn more about neural networks, click here

Decision trees

Decision trees algorithms enable one to automatically generate analytic sentences describing data like e.g., "If the thermometer 1 measures the temperature higher than 150 centigrade and the thermometer 2 measures the temperature higher than 120 centigrade then a damage is probable". This property is very important, since the skill of formulating such sentences about the surrounding world is a necessary (but not sufficient) condition to say that one "understands" this world.

Automatic cluster detection

One can view data records as points in a multidimensional space, which dimensions correspond to particular data attributes. If e.g., the data is about some machine, the dimensions can be temperature, pressure, power consumption etc. It may happen that the data records are spread out completely chaotically in such a space. However, sometimes they are grouped and build a sort of condensations, the so called clusters, which have usually some important meaning. Thus, e.g., if there are some clusters in a space describing address and education of people, this means that both features are related to each other. In two-dimensional spaces clusters may be seen with the naked eye, but in the case of many dimensions it is generally difficult to spot them without using special mathematical methods e.g., the so called K-means algorithm.