Data mining methods. An Introduction to Modern Data Mining What Mischaracters Data Mining

The development of methods for recording and storing data has led to a rapid increase in the volume of collected and analyzed information. The volumes of data are so impressive that it is simply not possible for a person to analyze them on their own, although the need for such an analysis is quite obvious, because this "raw" data contains knowledge that can be used to make decisions. In order to conduct automatic data analysis, Data Mining is used.

Data Mining is the process of discovering previously unknown non-trivial, practically useful and accessible knowledge in raw data, which is necessary for making decisions in various areas of human activity. Data Mining is one of the steps of Knowledge Discovery in Databases.

The information found in the process of applying Data Mining methods must be non-trivial and previously unknown, for example, average sales are not. Knowledge should describe new relationships between properties, predict the values of some features based on others, and so on. The knowledge found should be applicable to new data with some degree of certainty. The usefulness lies in the fact that this knowledge can bring certain benefits when applied. Knowledge should be in a form that is understandable to the user, not a mathematician. For example, the logical constructions "if ... then ..." are most easily perceived by a person. Moreover, such rules can be used in various DBMS as SQL queries. In the case when the extracted knowledge is not transparent to the user, there should be post-processing methods that allow them to be brought to an interpretable form.

The algorithms used in Data Mining require a lot of calculations. Previously, this was a deterrent to the widespread practical application of Data Mining, but today's performance growth of modern processors has removed the severity of this problem. Now, in a reasonable time, it is possible to conduct a qualitative analysis of hundreds of thousands and millions of records.

Tasks solved by Data Mining methods:

Classification- this is the assignment of objects (observations, events) to one of the previously known classes.
Regression, including forecasting problems. Establishing the dependence of continuous output on input variables.
Clustering is a grouping of objects (observations, events) based on data (properties) that describe the essence of these objects. Objects within a cluster must be "similar" to each other and different from objects included in other clusters. The more similar objects within a cluster and the more differences between clusters, the more accurate the clustering.
Association– identifying patterns between related events. An example of such a pattern is a rule that indicates that event Y follows from event X. Such rules are called associative. This problem was first proposed to find typical shopping patterns in supermarkets, so it is sometimes also called market basket analysis.
Sequential Patterns– establishment of patterns between time-related events, i.e. detecting a dependency that if event X occurs, then event Y will occur after a given time.
Variance Analysis– identification of the most uncharacteristic patterns.

Business analysis problems are formulated differently, but the solution to most of them comes down to one or another Data Mining task or a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are the elements from which you can assemble a solution to the vast majority of real business problems.

To solve the above problems, various methods and algorithms of Data Mining are used. In view of the fact that Data Mining has developed and is developing at the intersection of such disciplines as statistics, information theory, machine learning, database theory, it is quite natural that most Data Mining algorithms and methods were developed based on various methods from these disciplines. For example, the k-means clustering procedure was simply borrowed from statistics. The following Data Mining methods have gained great popularity: neural networks, decision trees, clustering algorithms, including scalable ones, algorithms for detecting associative links between events, etc.

Deductor is an analytical platform that includes a complete set of tools for solving Data Mining problems: linear regression, supervised neural networks, unsupervised neural networks, decision trees, search for association rules and many others. For many mechanisms, specialized visualizers are provided that greatly facilitate the use of the resulting model and the interpretation of the results. The strength of the platform is not only the implementation of modern analysis algorithms, but also the ability to arbitrarily combine various analysis mechanisms.

OLAP systems provide the analyst with a means of testing hypotheses when analyzing data, that is, the main task of the analyst is to generate hypotheses, which he solves based on his knowledge and experience. However, not only a person has knowledge, but also the accumulated data that is analyzed . Such knowledge is contained in a huge amount of information that a person is not able to explore on his own. In this regard, there is a possibility of missing hypotheses that can bring significant benefits.

To detect "hidden" knowledge, special methods of automatic analysis are used, with the help of which one has to practically extract knowledge from the "blockages" of information. The term “data mining (DataMining)” or “data mining” has been assigned to this direction.

There are many definitions of DataMining that complement each other. Here are some of them.

DataMining is the process of discovering non-trivial and practically useful patterns in databases. (base group)

Data Mining is the process of extracting, exploring and modeling large amounts of data to discover previously unknown patterns (patters) in order to achieve business benefits (SAS Institute)

DataMining is a process that aims to discover significant new correlations, patterns, and trends by sifting through large amounts of stored data using pattern recognition techniques plus the application of statistical and mathematical methods (GartnerGroup)

DataMining is the study and discovery by a “machine” (algorithms, artificial intelligence tools) in raw data of hidden knowledge thatpreviously unknown, non-trivial, practically useful, available for interpretationhuman. (A. Bargesyan "Technologies for data analysis")

DataMining is the process of discovering useful knowledge about business. (N.M. Abdikeev "KBA")

Properties of discoverable knowledge

Consider the properties of the knowledge to be discovered.

Knowledge must be new, previously unknown. The effort spent on discovering knowledge that is already known to the user does not pay off. Therefore, it is new, previously unknown knowledge that is of value.
Knowledge must be non-trivial. The results of the analysis should reflect non-obvious, unexpectedpatterns in the data that make up the so-called hidden knowledge. Results that could be obtained more simple ways(for example, by visual inspection) do not justify the use of powerful DataMining methods.
Knowledge should be practically useful. The knowledge found should be applicable, including on new data, with a sufficiently high degree of reliability. The usefulness lies in the fact that this knowledge can bring some benefit in its application.
Knowledge must be accessible to human understanding. The patterns found must be logically explainable, otherwise there is a possibility that they are random. In addition, the discovered knowledge should be presented in a human-understandable form.

In DataMining, models are used to represent the acquired knowledge. Types of models depend on the methods of their creation. The most common are: rules, decision trees, clusters and mathematical functions.

Data Mining Tasks

Recall that the basis technologies Data Mining is based on the concept of patterns, which are patterns. As a result of the discovery of these regularities hidden from the naked eye, DataMining problems are solved. different types patterns that can be expressed in a human-readable form correspond to certain DataMining tasks.

There is no consensus on what tasks should be attributed to DataMining. Most authoritative sources list the following: classification,

clustering, prediction, association, visualization, analysis and discovery

deviations, evaluation, analysis of relationships, debriefing.

The purpose of the description that follows is to give an overview of the problems of DataMining, to compare some of them, and also to present some of the methods by which these problems are solved. The most common DataMining tasks are classification, clustering, association, prediction, and visualization. Thus, tasks are divided according to the types of information produced, this is the most general classification of DataMining tasks.

Classification

The task of dividing a set of objects or observations into a priori given groups, called classes, within each of which they are assumed to be similar to each other, having approximately the same properties and features. In this case, the solution is obtained on the basis of analysis attribute (feature) values.

Classification is one of the most important tasks datamining . It is applied in marketing when assessing the creditworthiness of borrowers, determining customer loyalty, pattern recognition , medical diagnostics and many other applications. If the analyst knows the properties of objects of each class, then when a new observation belongs to a certain class, these properties automatically apply to it.

If the number of classes is limited to two, thenbinary classification , to which many more complex problems can be reduced. For example, instead of defining such degrees of credit risk as "High", "Medium" or "Low", you can use only two - "Issue" or "Refuse".

For classification in DataMining, many different models are used: neural networks, decision trees , support vector machines, k-nearest neighbors, coverage algorithms, etc., which are constructed using supervised learning whenoutput variable(class label ) is given for each observation. Formally, the classification is based on the partitionfeature spaces in areas, within each of whichmultidimensional vectors are considered identical. In other words, if an object has fallen into a region of space associated with a certain class, it belongs to it.

Clustering

Short description. Clustering is a logical continuation of the idea

classification. This task is more complicated, the peculiarity of clustering is that the classes of objects are not initially predetermined. The result of clustering is the division of objects into groups.

An example of a method for solving a clustering problem: training "without a teacher" of a special type of neural networks - Kohonen's self-organizing maps.

Association (Associations)

Short description. In the course of solving the problem of searching for association rules, patterns are found between related events in a data set.

The difference between the association and the two previous DataMining tasks is that the search for patterns is not based on the properties of the analyzed object, but between several events that occur simultaneously. The most well-known algorithm for solving the problem of finding association rules is the Apriori algorithm.

Sequence or sequential association

Short description. The sequence allows you to find temporal patterns between transactions. The task of a sequence is similar to an association, but its goal is to establish patterns not between simultaneously occurring events, but between events connected in time (i.e., occurring at some specific interval in time). In other words, the sequence is determined by the high probability of a chain of events related in time. In fact, an association is a special case of a sequence with zero time lag. This DataMining problem is also called the sequential pattern problem.

Sequence rule: after event X, event Y will occur after a certain time.

Example. After buying an apartment, tenants in 60% of cases purchase a refrigerator within two weeks, and within two months, in 50% of cases, a TV is purchased. The solution to this problem is widely used in marketing and management, for example, in managing the customer lifecycle (CustomerLifecycleManagement).

Regression, forecasting (Forecasting)

Short description. As a result of solving the problem of forecasting, based on the characteristics of historical data, the missing or future values of target numerical indicators are estimated.

To solve such problems, methods of mathematical statistics, neural networks, etc. are widely used.

Additional tasks

Determination of deviations or outliers (DeviationDetection), variance or outlier analysis

Short description. The purpose of solving this problem is the detection and analysis of data that differs most from the general set of data, the identification of so-called uncharacteristic patterns.

Estimation

The task of estimation is reduced to predicting continuous values of a feature.

Link analysis (LinkAnalysis)

The task of finding dependencies in a data set.

Visualization (Visualization, GraphMining)

As a result of visualization, a graphic image of the analyzed data is created. To solve the visualization problem, graphical methods are used to show the presence of patterns in the data.

An example of visualization techniques is the presentation of data in 2-D and 3-D dimensions.

Summarization

The task, the purpose of which is the description of specific groups of objects from the analyzed data set.

Quite close to the above classification is the division of DataMining tasks into the following: research and discovery, prediction and classification, explanation and description.

Automatic research and discovery (free search)

Task example: discovery of new market segments.

To solve this class of problems, methods of cluster analysis are used.

Prediction and classification

Sample problem: predict sales growth based on current values.

Methods: regression, neural networks, genetic algorithms, decision trees.

The tasks of classification and forecasting constitute a group of so-called inductive modeling, which results in the study of the analyzed object or system. In the process of solving these problems, a general model or hypothesis is developed on the basis of a data set.

Explanation and description

Sample problem: characterizing customers by demographics and purchase histories.

Methods: decision trees, rule systems, association rules, link analysis.

If the client's income is more than 50 conventional units, and his age is more than 30 years, then the client's class is the first.

Comparison of clustering and classification

Characteristic	Classification	Clustering
Controllability of learning	controlled	uncontrollable
Strategies	Learning with a teacher	Learning without a teacher
Presence of a class label	Training set accompanied by a label indicating the class to which it belongs observation	Teaching class labels sets unknown
Basis for classification	New data is classified based on the training set	Given a lot of data for the purpose establishing the existence classes or data clusters

Scopes of DataMining

It should be noted that today DataMining technology is most widely used in solving business problems. Perhaps the reason is that it is in this direction that the return on using DataMining tools can be, according to some sources, up to 1000%, and the costs of its implementation can quickly pay off.

We will look at the four main applications of DataMining technology in detail: science, business, government research, and the Web.

business tasks. Main areas: banking, finance, insurance, CRM, manufacturing, telecommunications, e-commerce, marketing, stock market and others.

Whether to issue a loan to the client

Market segmentation

Attraction of new clients

Credit card fraud

Application of DataMining for solving problems of the state level. Main directions: search for tax evaders; means in the fight against terrorism.

Application of DataMining for scientific research. Main areas: medicine, biology, molecular genetics and genetic engineering, bioinformatics, astronomy, applied chemistry, drug addiction research, and others.

Applying DataMining to a Solution Web tasks. Main directions: search engines (searchengines), counters and others.

Electronic commerce

In the field of e-commerce, DataMining is used to generate

This classification allows companies to identify specific groups of customers and conduct marketing policies in accordance with the identified interests and needs of customers. DataMining technology for e-commerce is closely related to WebMining technology.

The main tasks of DataMining in industrial production:

complex system analysis of production situations;

· short-term and long-term forecast of the development of production situations;

development of options for optimization solutions;

Predicting the quality of a product depending on some parameters

technological process;

detection of hidden trends and patterns of development of production

processes;

forecasting patterns of development of production processes;

detection of hidden factors of influence;

detection and identification of previously unknown relationships between

production parameters and factors of influence;

analysis of the environment of interaction of production processes and forecasting

changes in its characteristics;

processes;

visualization of analysis results, preparation of preliminary reports and projects

feasible solutions with estimates of the reliability and efficiency of possible implementations.

Marketing

In the field of marketing, DataMining is widely used.

Basic marketing questions "What is for sale?", "How is it for sale?", "Who is

consumer?"

In the lecture on classification and clustering problems, the use of cluster analysis for solving marketing problems, such as consumer segmentation, is described in detail.

Another common set of methods for solving marketing problems are methods and algorithms for searching for association rules.

The search for temporal patterns is also successfully used here.

Retail

In the field retail, as in marketing, apply:

Algorithms for searching for association rules (for determining frequently occurring sets

goods that buyers buy at the same time). The identification of such rules helps

place goods on the shelves of trading floors, develop strategies for the purchase of goods

and their placement in warehouses, etc.

use of time sequences, for example, to determine

the required amount of inventory in the warehouse.

methods of classification and clustering to identify groups or categories of customers,

knowledge of which contributes to the successful promotion of goods.

Stock market

Here is a list of stock market problems that can be solved using Data technology

Mining: Forecasting future values of financial instruments and indicators

past values;

forecast of the trend (future direction of movement - growth, fall, flat) of the financial

instrument and its strength (strong, moderately strong, etc.);

allocation of the cluster structure of the market, industry, sector according to a certain set

characteristics;

· dynamic portfolio management;

volatility forecast;

risk assessment;

prediction of the onset of the crisis and the forecast of its development;

selection of assets, etc.

In addition to the areas of activity described above, DataMining technology can be applied in a wide variety of business areas where there is a need for data analysis and a certain amount of retrospective information has been accumulated.

Application of DataMining in CRM

One of the most promising applications of DataMining is the use of this technology in analytical CRM.

CRM (Customer Relationship Management) - customer relationship management.

When these technologies are used together, knowledge mining is combined with "money mining" from customer data.

An important aspect in the work of the marketing and sales departments is the preparationa holistic view of customers, information about their features, characteristics, structure of the customer base. CRM uses the so-called profilingcustomers, giving a complete view of all the necessary information about customers.

Customer profiling includes the following components: customer segmentation, customer profitability, customer retention, customer response analysis. Each of these components can be explored using DataMining, and analyzing them together as profiling components can result in knowledge that cannot be obtained from each individual characteristic.

webmining

WebMining can be translated as "data mining on the Web". WebIntelligence or Web.

Intelligence is ready to "open a new chapter" in the rapid development of e-business. The ability to determine the interests and preferences of each visitor by observing their behavior is a serious and critical competitive advantage in the e-commerce market.

WebMining systems can answer many questions, for example, which of the visitors is a potential client of the Web store, which group of customers of the Web store brings the most income, what are the interests of a particular visitor or group of visitors.

Methods

Classification of methods

There are two groups of methods:

statistical methods based on the use of average accumulated experience, which is reflected in retrospective data;
cybernetic methods, including many heterogeneous mathematical approaches.

The disadvantage of such a classification is that both statistical and cybernetic algorithms rely in one way or another on the comparison of statistical experience with the results of monitoring the current situation.

The advantage of such a classification is its convenience for interpretation - it is used in describing the mathematical tools of the modern approach to extracting knowledge from arrays of initial observations (operational and retrospective), i.e. in Data Mining tasks.

Let's take a closer look at the above groups.

Statistical Methods Data mining

In these methods are four interrelated sections:

preliminary analysis of the nature of statistical data (testing the hypotheses of stationarity, normality, independence, homogeneity, evaluation of the type of distribution function, its parameters, etc.);
identifying links and patterns(linear and non-linear regression analysis, correlation analysis, etc.);
multidimensional statistical analysis (linear and non-linear discriminant analysis, cluster analysis, component analysis, factor analysis, etc.);
dynamic models and forecast based on time series.

The arsenal of statistical methods Data Mining is classified into four groups of methods:

Descriptive analysis and description of initial data.
Relationship analysis (correlation and regression analysis, factor analysis, analysis of variance).
Multivariate statistical analysis (component analysis, discriminant analysis, multivariate regression analysis, canonical correlations, etc.).
Time series analysis (dynamic models and forecasting).

Cybernetic Data Mining Methods

The second direction of Data Mining is a set of approaches united by the idea of computer mathematics and the use of artificial intelligence theory.

This group includes the following methods:

artificial neural networks (recognition, clustering, forecast);
evolutionary programming (including algorithms of the method of group accounting of arguments);
genetic algorithms (optimization);
associative memory (search for analogues, prototypes);
fuzzy logic;
decision trees;
expert knowledge processing systems.

cluster analysis

The purpose of clustering is to search for existing structures.

Clustering is a descriptive procedure, it does not draw any statistical conclusions, but it provides an opportunity to conduct exploratory analysis and study the "structure of the data".

The very concept of "cluster" is defined ambiguously: each study has its own "clusters". The concept of a cluster (cluster) is translated as "cluster", "bunch". A cluster can be described as a group of objects that have common properties.

There are two characteristics of a cluster:

internal homogeneity;
external isolation.

A question that analysts ask in many problems is how to organize data into visual structures, i.e. expand taxonomies.

Initially, clustering was most widely used in such sciences as biology, anthropology, and psychology. For a long time, clustering has been little used to solve economic problems due to the specifics of economic data and phenomena.

Clusters can be non-overlapping, or exclusive (non-overlapping, exclusive), and intersecting (overlapping).

It should be noted that as a result of applying various methods of cluster analysis, clusters of various shapes can be obtained. For example, clusters of a "chain" type are possible, when clusters are represented by long "chains", elongated clusters, etc., and some methods can create arbitrary-shaped clusters.

Various methods may aim to create clusters of certain sizes (eg small or large) or assume clusters of different sizes in the data set. Some cluster analysis methods are particularly sensitive to noise or outliers, while others are less so. As a result of applying different clustering methods, different results can be obtained, this is normal and is a feature of the operation of a particular algorithm. These features should be taken into account when choosing a clustering method.

Let us give a brief description of approaches to clustering.

Algorithms based on data partitioning (Partitioningalgorithms), incl. iterative:

division of objects into k clusters;
iterative redistribution of objects to improve clustering.
Hierarchical algorithms (Hierarchyalgorithms):
agglomeration: each object is initially a cluster, clusters,
connecting with each other, form a larger cluster, etc.

Methods based on the concentration of objects (Density-basedmethods):

based on the connectivity of objects;
ignore noises, finding arbitrary-shaped clusters.

Grid - methods (Grid-based methods):

quantization of objects in grid structures.

Model methods (Model-based):

using the model to find the clusters that best fit the data.

Methods of cluster analysis. iterative methods.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods based on division are used, which are iterative methods of splitting the original population. During the division process, new clusters are formed until the stopping rule is met.

Such non-hierarchical clustering consists of dividing a data set into a certain number of distinct clusters. There are two approaches. The first is to define the boundaries of clusters as the densest areas in the multidimensional space of the initial data, i.e. definition of a cluster where there is a large "concentration of points". The second approach is to minimize the measure of object difference

Algorithm k-means (k-means)

The most common among non-hierarchical methods is the k-means algorithm, also called fast cluster analysis. A complete description of the algorithm can be found in Hartigan and Wong (1978). Unlike hierarchical methods, which do not require preliminary assumptions about the number of clusters, to be able to use this method, it is necessary to have a hypothesis about the most probable number of clusters.

The k-means algorithm builds k clusters spaced as far apart as possible. The main type of problems that the k-means algorithm solves is the presence of assumptions (hypotheses) regarding the number of clusters, while they should be as different as possible. The choice of the number k may be based on previous research, theoretical considerations, or intuition.

The general idea of the algorithm: a given fixed number k of observation clusters are compared to clusters in such a way that the averages in the cluster (for all variables) differ as much as possible from each other.

Description of the algorithm

1. Initial distribution of objects by clusters.

The number k is chosen, and at the first step these points are considered to be the "centers" of the clusters.
Each cluster corresponds to one center.

The choice of initial centroids can be carried out as follows:

choosing k-observations to maximize the initial distance;
random selection of k-observations;
choice of the first k-observations.

As a result, each object is assigned to a specific cluster.

2. Iterative process.

The centers of the clusters are calculated, which then and further are considered to be the coordinate means of the clusters. Objects are redistributed again.

The process of calculating centers and redistributing objects continues until one of the following conditions is met:

cluster centers have stabilized, i.e. all observations belong to the cluster they belonged to before the current iteration;
the number of iterations is equal to the maximum number of iterations.

The figure shows an example of the operation of the k-means algorithm for k equal to two.

An example of the k-means algorithm (k=2)

The choice of the number of clusters is a complex issue. If there are no assumptions about this number, it is recommended to create 2 clusters, then 3, 4, 5, etc., comparing the results.

Checking the quality of clustering

After obtaining the results of cluster analysis using the k-means method, one should check the correctness of the clustering (i.e., evaluate how the clusters differ from each other).

To do this, average values for each cluster are calculated. Good clustering should produce very different means for all measurements, or at least most of them.

Advantages of the k-means algorithm:

ease of use;
speed of use;
clarity and transparency of the algorithm.

Disadvantages of the k-means algorithm:

the algorithm is too sensitive to outliers that can distort the mean.

A possible solution to this problem is to use a modification of the algorithm - the k-median algorithm;

the algorithm can be slow on large databases. A possible solution to this problem is to use data sampling.

Bayesian networks

In probability theory, the concept of information dependency is modeled by conditional dependency (or strictly: lack of conditional independence), which describes how our confidence in the outcome of some event changes when we gain new knowledge about the facts, provided that we already knew some set of other facts.

It is convenient and intuitive to represent dependencies between elements by means of a directed path connecting these elements in a graph. If the relationship between elements x and y is not direct and is carried out through the third element z, then it is logical to expect that there will be an element z on the path between x and y. Such intermediary nodes will "cut off" the dependence between x and y, i.e. to model a situation of conditional independence between them with a known value of direct factors of influence.Such modeling languages are Bayesian networks, which serve to describe conditional dependencies between the concepts of a certain subject area.

Bayesian networks are graphical structures for representing probabilistic relationships between a large number of variables and for performing probabilistic inference based on those variables."Naive" (Bayesian) classification is a fairly transparent and understandable method of classification. "Naive" it is called because it proceeds from the assumption of mutualfeature independence.

Classification properties:

1. Using all variables and defining all dependencies between them.

2. Having two assumptions about variables:

all variables are equally important;
all variables are statistically independent, i.e. The value of one variable says nothing about the value of the other.

There are two main scenarios for using Bayesian networks:

1. Descriptive analysis. The subject area is displayed as a graph, the nodes of which represent concepts, and the directed arcs displayed by arrows illustrate the direct relationships between these concepts. The relationship between x and y means that knowing the value of x helps you make a better guess about the value of y. The absence of a direct connection between concepts models the conditional independence between them, given the known values of a certain set of "separating" concepts. For example, a child's shoe size is obviously related to a child's ability to read through age. Thus, a larger shoe size gives more confidence that the child is already reading, but if we already know the age, then knowing the shoe size will no longer give us additional information about the child's ability to read.

As another, opposite, example, consider such initially unrelated factors as smoking and a cold. But if we know a symptom, for example, that a person suffers from a morning cough, then knowing that a person does not smoke increases our confidence that a person has a cold.

2. Classification and forecasting. The Bayesian network, allowing for the conditional independence of a number of concepts, makes it possible to reduce the number of joint distribution parameters, making it possible to estimate them confidently on the available data volumes. So, with 10 variables, each of which can take 10 values, the number of joint distribution parameters is 10 billion - 1. If we assume that only 2 variables depend on each other between these variables, then the number of parameters becomes 8 * (10-1) + (10 * 10-1) = 171. Having a model of joint distribution that is realistic in terms of computational resources, we can predict the unknown value of a concept as, for example, the most probable value of this concept with known values of other concepts.

They note such advantages of Bayesian networks as a DataMining method:

Dependencies between all variables are defined in the model, this makes it easy tohandle situations in which the values of some variables are unknown;

Bayesian networks are quite simply interpreted and allow at the stagepredictive modeling is easy to carry out the analysis of the scenario "what if";

The Bayesian method allows you to naturally combine patterns,derived from data and, for example, expert knowledge obtained explicitly;

Using Bayesian networks avoids the problem of overfitting(overfitting), that is, excessive complication of the model, which is a weaknessmany methods (for example, decision trees and neural networks).

The Naive Bayesian approach has the following disadvantages:

Multiplying conditional probabilities is correct only when all inputsthe variables are indeed statistically independent; although this method is oftenshows fairly good results if the condition of the statisticalindependence, but theoretically such a situation should be handled by more complexmethods based on training Bayesian networks;

Impossible direct processing of continuous variables - they are requiredconversion to an interval scale so that the attributes are discrete; however, suchtransformations can sometimes lead to the loss of meaningful patterns;

The classification result in the Naive Bayesian approach is affected only byindividual values of input variables, combined influence of pairs ortriplets of values of different attributes are not taken into account here. This could improvethe quality of the classification model in terms of its predictive accuracy,however, would increase the number of variants tested.

Artificial neural networks

Artificial neural networks (hereinafter referred to as neural networks) can be synchronous and asynchronous.In synchronous neural networks, at each moment of time, only one neuron. In asynchronous - the state changes immediately for a whole group of neurons, as a rule, for everything layer. Two basic architectures can be distinguished - layered and fully connected networks.The key concept in layered networks is the concept of a layer.Layer - one or more neurons, the inputs of which are supplied with the same common signal.Layered neural networks are neural networks in which neurons are divided into separate groups (layers) so that information processing is carried out in layers.In layered networks, the neurons of the i-th layer receive input signals, transform them, and pass them through the branch points to the neurons (i + 1) of the layer. And so on until the kth layer, which givesoutput signals for the interpreter and the user. The number of neurons in each layer is not related to the number of neurons in other layers, it can be arbitrary.Within one layer, data is processed in parallel, and across the entire network, processing is carried out sequentially - from layer to layer. Layered neural networks include, for example, multilayer perceptrons, networks of radial basis functions, cognitron, noncognitron, associative memory networks.However, the signal is not always applied to all neurons of the layer. In a cognitron, for example, each neuron of the current layer receives signals only from neurons close to it in the previous layer.

Layered networks, in turn, can be single-layer and multi-layer.

Single layer network- a network consisting of one layer.

Multilayer network- a network with several layers.

In a multilayer network, the first layer is called the input layer, the subsequent layers are called internal or hidden, and the last layer is the output layer. Thus, intermediate layers are all layers in a multilayer neural network, except for the input and output.The input layer of the network implements the connection with the input data, the output layer - with the output.Thus, neurons can be input, output and hidden.The input layer is organized from input neurons that receive data and distribute it to the inputs of neurons in the hidden layer of the network.A hidden neuron is a neuron located in the hidden layer of a neural network.Output neurons, from which the output layer of the network is organized, producesresults of the neural network.

In fully connected networks each neuron transmits its output signal to the rest of the neurons, including itself. The output signals of the network can be all or some of the output signals of neurons after several clock cycles of the network.

All input signals are fed to all neurons.

Training of neural networks

Before using a neural network, it must be trained.The learning process of a neural network consists in adjusting its internal parameters for a specific task.The neural network algorithm is iterative, its steps are called epochs or cycles.Epoch - one iteration in the learning process, including the presentation of all examples from the training set and, possibly, checking the quality of training on the control set. The learning process is carried out on the training set.The training sample includes the input values and their corresponding output values from the dataset. In the course of training, the neural network finds some dependences of the output fields on the input ones.Thus, we are faced with the question - what input fields (features) do we neednessesary to use. Initially, the choice is made heuristically, thenthe number of inputs can be changed.

Complexity can raise the issue of the number of observations in the dataset. And although there are some rules describing the relationship between the required number of observations and the size of the network, their correctness has not been proven.The number of necessary observations depends on the complexity of the problem being solved. With an increase in the number of features, the number of observations increases non-linearly, this problem is called the "curse of dimensionality". With insufficient quantitydata, it is recommended to use a linear model.

The analyst must determine the number of layers in the network and the number of neurons in each layer.Next, you need to assign such values of weights and biases that canminimize decision error. The weights and biases are automatically adjusted in such a way as to minimize the difference between the desired and the output signals, which is called the training error.The learning error for the constructed neural network is calculated by comparingoutput and target (desired) values. The error function is formed from the obtained differences.

The error function is an objective function that needs to be minimized in the processcontrolled neural network learning.Using the error function, you can evaluate the quality of the neural network during training. For example, the sum of squared errors is often used.The ability to solve the assigned tasks depends on the quality of neural network training.

Neural network retraining

When training neural networks, a serious difficulty often arises, calledoverfitting problem.Overfitting, or overfitting - overfittingneural network to a specific set of training examples, in which the network losesgeneralization ability.Overfitting occurs when training is too long, not enoughtraining examples or overcomplicated neural network structure.Overfitting is due to the fact that the choice of training (training) setis random. From the first steps of training, the error is reduced. On thesubsequent steps in order to reduce the error (objective function) parametersadjusted to the characteristics of the training set. However, this happens"adjustment" not to the general patterns of the series, but to the features of its part -training subset. In this case, the accuracy of the forecast decreases.One of the options for dealing with network retraining is to divide the training sample into twosets (training and test).On the training set, the neural network is trained. On the test set, the constructed model is checked. These sets must not intersect.With each step, the parameters of the model change, however, a constant decreasevalue of the objective function occurs precisely on the training set. When splitting the set into two, we can observe the change in the forecast error on the test set in parallel with the observations on the training set. Somethe number of prediction error steps decreases on both sets. However, onat a certain step, the error on the test set begins to increase, while the error on the training set continues to decrease. This moment is considered the beginning of retraining.

Data mining tools

Development in the DataMining sector of the world market software both world-famous leaders and new emerging companies are occupied. DataMining tools can be presented either as a standalone application or as add-ons to the main product.The latter option is implemented by many software market leaders.So, it has already become a tradition that the developers of universal statistical packages, in addition to the traditional methods of statistical analysis, include in the packagea certain set of DataMining methods. These are packages like SPSS (SPSS, Clementine), Statistica (StatSoft), SAS Institute (SAS Enterprise Miner). Some OLAP solution developers also offer a set of DataMining techniques, such as the Cognos family of products. There are providers that include DataMining solutions in the functionality of the DBMS: these are Microsoft (MicrosoftSQLServer), Oracle, IBM (IBMintelligentMinerforData).

Bibliography

Abdikeev N.M. Danko T.P. Ildemenov S.V. Kiselev A.D., “Reengineering of business processes. MBA Course”, Moscow: Eksmo Publishing House, 2005. - 592 p. - (MBA)

Abdikeev N.M., Kiselev A.D. "Knowledge management in corporations and business reengineering" - M.: Infra-M, 2011.- 382 p. – ISBN 978-5-16-004300-5

Barseghyan A.A., Kupriyanov M.S., Stepanenko V.V., Holod I.I. "Methods and models of data analysis: OLAP and Data Mining", St. Petersburg: BHV-Petersburg, 2004, 336 pp., ISBN 5-94157-522-X

Duke IN., Samoilenko BUT., “Data Mining.Training course "SPb: Piter, 2001, 386s.

Chubukova I.A., Data Mining course, http://www.intuit.ru/department/database/datamining/

IanH. Witten, Eibe Frank, Mark A. Hall, Morgan Kaufmann, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), ISBN 978-0-12-374856-0

Petrushin V.A. , Khan L. , Multimedia Data Mining and Knowledge Discovery

Data Mining is divided into two large groups according to the principle of working with the initial training data. In this classification, the top level is determined based on whether data is stored after Data Mining or whether it is distilled for later use.

1. Direct use of the data, or saving data.

In this case, the initial data is stored in an explicit detailed form and is directly used at the stages and / or exception analysis. The problem with this group of methods is that when using them, it may be difficult to analyze very large databases.

Methods of this group: cluster analysis, nearest neighbor method, k-nearest neighbor method, reasoning by analogy.

2. Identification and use of formalized patterns, or template distillation.

With technology distillation patterns one sample (template) of information is extracted from the source data and converted into some formal constructions, the form of which depends on the Data Mining method used. This process is carried out at the stage free search, the first group of methods lacks this stage in principle. On stages predictive modeling And exception analysis the results of the stage are used free search, they are much more compact than the databases themselves. Recall that the constructions of these models can be interpretable by the analyst or non-interpretable ("black boxes").

Methods of this group: logical methods ; visualization methods; cross-tab methods; methods based on equations.

Logical methods, or methods of logical induction, include: fuzzy queries and analyses; symbolic rules; decision trees; genetic algorithms.

The methods of this group are perhaps the most interpretable - they draw up the patterns found, in most cases, in a fairly transparent form from the user's point of view. The resulting rules may include continuous and discrete variables. It should be noted that decision trees can be easily converted into sets of symbolic rules by generating one rule along the path from the root of the tree to its terminal vertex. Decision trees and rules are actually different ways of solving the same problem and differ only in their capabilities. Moreover, the implementation of the rules is done by slower algorithms than the induction of decision trees.

Cross-tab methods: agents, Bayesian (confidence) networks, cross-tab visualization. The last method does not quite correspond to one of the properties of Data Mining - independent search patterns analytical system. However, providing information in the form of cross-tabs provides the implementation of the main task of Data Mining - the search for patterns, so this method can also be considered one of the Data Mining methods.

Methods based on equations.

The methods of this group express the revealed patterns in the form of mathematical expressions - equations. Therefore, they can only work with numeric variables, and variables of other types must be encoded accordingly. This somewhat limits the application of the methods of this group; nevertheless, they are widely used in solving various problems, especially forecasting problems.

The main methods of this group: statistical methods and neural networks

Statistical methods are most often used to solve forecasting problems. There are many methods of statistical data analysis, among them, for example, correlation and regression analysis, correlation of time series, identification of trends in time series, harmonic analysis.

Another classification divides the whole variety of Data Mining methods into two groups: statistical and cybernetic methods. This separation scheme is based on various approaches to teaching mathematical models.

It should be noted that there are two approaches to classifying statistical methods as Data Mining. The first of them contrasts statistical methods and Data Mining, its supporters consider classical statistical methods to be a separate area of data analysis. According to the second approach, statistical analysis methods are part of the Data Mining mathematical toolkit. Most authoritative sources take the second approach.

In this classification, two groups of methods are distinguished:

statistical methods based on the use of average accumulated experience, which is reflected in retrospective data;
cybernetic methods, including many heterogeneous mathematical approaches.

The advantage of such a classification is its convenience for interpretation - it is used in describing the mathematical means of the modern approach to extracting knowledge from arrays of initial observations (operational and retrospective), i.e. in Data Mining tasks.

Let's take a closer look at the above groups.

Statistical Methods Data mining

In these methods are four interrelated sections:

preliminary analysis of the nature of statistical data (testing the hypotheses of stationarity, normality, independence, homogeneity, evaluation of the type of distribution function, its parameters, etc.);
identifying links and patterns(linear and non-linear regression analysis, correlation analysis, etc.);
multivariate statistical analysis (linear and non-linear discriminant analysis, cluster analysis, component analysis, factor analysis and etc.);
dynamic models and forecast based on time series.

The arsenal of statistical methods Data Mining is classified into four groups of methods:

Descriptive analysis and description of initial data.
Relationship analysis (correlation and regression analysis, factor analysis, analysis of variance).
Multivariate statistical analysis (component analysis, discriminant analysis, multivariate regression analysis, canonical correlations, etc.).
Time series analysis ( dynamic models and forecasting).

Cybernetic Data Mining Methods

The second direction of Data Mining is a set of approaches united by the idea of computer mathematics and the use of artificial intelligence theory.

Artificial neural networks, genetic algorithms, evolutionary programming, associative memory, fuzzy logic. Data mining methods often include statistical methods(descriptive analysis, correlation and regression analysis, factor analysis, analysis of variance, component analysis, discriminant analysis, time series analysis). Such methods, however, require some a priori ideas about the data being analyzed, which is somewhat at odds with the goals. data mining(detection of previously unknown non-trivial and practically useful knowledge).

One of the most important purposes of Data Mining methods is a visual representation of the results of calculations, which allows the use of Data Mining tools by people who do not have special mathematical training. At the same time, the use of statistical methods for data analysis requires a good command of probability theory and mathematical statistics.

Introduction

Data mining methods (or, equivalently, Knowledge Discovery In Data, for short, KDD) lie at the intersection of databases, statistics and artificial intelligence.

Historical digression

The field of Data Mining began with a workshop conducted by Grigory Pyatetsky-Shapiro in 1989.

Earlier, while working at GTE Labs, Grigory Pyatetsky-Shapiro became interested in the question: is it possible to automatically find certain rules in order to speed up some queries to large databases. At the same time, two terms were proposed - Data Mining (“data mining”) and Knowledge Discovery In Data (which should be translated as “knowledge discovery in databases”).

Formulation of the problem

Initially, the task is set as follows:

there is a fairly large database;
it is assumed that there is some "hidden knowledge" in the database.

It is necessary to develop methods for discovering knowledge hidden in large volumes of initial "raw" data.

What does "hidden knowledge" mean? It must be knowledge of:

previously unknown - that is, such knowledge that should be new (and not confirming any previously received information);
non-trivial - that is, those that cannot be simply seen (with direct visual analysis of data or when calculating simple statistical characteristics);
practically useful - that is, such knowledge that is of value to the researcher or consumer;
accessible for interpretation - that is, such knowledge that is easy to present in a visual form for the user and easy to explain in terms of the subject area.

These requirements largely determine the essence of Data mining methods and how and in what proportion Data mining technology uses database management systems, statistical methods of analysis and artificial intelligence methods.

Data mining and databases

Data mining methods make sense to apply only for sufficiently large databases. Each specific area of research has its own criterion for the "greatness" of the database.

The development of database technologies first led to the creation of a specialized language - the database query language. For relational databases, this is the SQL language, which provided ample opportunities for creating, modifying, and retrieving stored data. Then there was a need to obtain analytical information (for example, information about the activities of an enterprise for a certain period), and here it turned out that traditional relational databases, well adapted, for example, for keeping operational records (at an enterprise), are poorly adapted for analysis. this led, in turn, to the creation of the so-called. "data stores", the very structure of which in the best possible way corresponds to a comprehensive mathematical analysis.

Data mining and statistics

Data mining methods are based on mathematical methods of data processing, including statistical methods. In industrial solutions, often, such methods are directly included in data mining packages. However, it should be borne in mind that often researchers unreasonably use parametric tests instead of non-parametric ones to simplify, and secondly, the results of the analysis are difficult to interpret, which is completely at odds with the goals and objectives of Data mining. However, statistical methods are used, but their application is limited to performing only certain stages of the study.

Data mining and artificial intelligence

Knowledge obtained by Data mining methods is usually represented as models. These models are:

association rules;
decision trees;
clusters;
mathematical functions.

Methods for constructing such models are usually referred to the area of the so-called. "artificial intelligence".

Tasks

The tasks solved by Data Mining methods are usually divided into descriptive (eng. descriptive) and predictive (eng. predictive).

In descriptive tasks, the most important thing is to give a visual description of the existing hidden patterns, while in predictive tasks, the question of prediction for those cases for which there are no data is in the forefront.

Descriptive tasks include:

search for association rules or patterns (samples);
grouping of objects, cluster analysis;
building a regression model.

Predictive tasks include:

classification of objects (for predefined classes);
regression analysis, time series analysis.

Learning algorithms

Classification problems are characterized by “supervised learning”, in which the construction (training) of the model is performed on a sample containing input and output vectors.

For clustering and association problems, “unsupervised learning” is used, in which the model is built on a sample that does not have an output parameter. The value of the output parameter (“refers to a cluster ...”, “looks like a vector ...”) is selected automatically in the learning process.

For description reduction problems, it is typical no separation into input and output vectors. Beginning with C. Pearson's classic work on principal component analysis, the focus is on data approximation.

Stages of learning

A typical series of stages for solving problems using Data Mining methods is distinguished:

Hypothesis formation;
Data collection;
Data preparation (filtering);
Model selection;
Selection of model parameters and learning algorithm;
Model training ( automatic search other parameters of the model);
Analysis of the quality of education, if the transition to item 5 or item 4 is unsatisfactory;
Analysis of the identified patterns, if the transition to step 1, 4 or 5 is unsatisfactory.

Data preparation

Before using Data Mining algorithms, it is necessary to prepare a set of analyzed data. Since IAD can only detect patterns that are present in the data, the initial data, on the one hand, must be large enough for these patterns to be present in them, and, on the other hand, be compact enough so that the analysis takes an acceptable time. Most often, data warehouses or data marts act as source data. Preparation is required to analyze multidimensional data prior to clustering or data mining.

The cleaned data is reduced to feature sets (or vectors if the algorithm can only work with fixed-dimensional vectors), one feature set per observation. The set of features is formed in accordance with the hypotheses about which features of the raw data have a high predictive power based on the required computing power for processing. For example, a 100×100 pixel black and white face image contains 10,000 bits of raw data. They can be converted to a feature vector by detecting eyes and mouths in the image. As a result, there is a reduction in the amount of data from 10 thousand bits to a list of position codes, significantly reducing the amount of analyzed data, and hence the analysis time.

A number of algorithms are able to process missing data that has predictive power (for example, the absence of a certain type of purchase by a client). Say, when using the method of association rules (English) Russian not feature vectors are processed, but sets of variable dimension.

The choice of the objective function will depend on what is the purpose of the analysis; choosing the "right" function is fundamental to successful data mining.

Observations are divided into two categories - training set and test set. The training set is used to "train" the Data Mining algorithm, and the test set is used to test the patterns found.

Notes

Literature

Paklin N. B., Oreshkov V. I. Business Intelligence: From Data to Knowledge (+ CD). - St. Petersburg. : Ed. Peter, 2009. - 624 p.

Duke V., Samoylenko A. Data Mining: training course (+CD). - St. Petersburg. : Ed. Peter, 2001. - 368 p.

Zhuravlev Yu.I. , Ryazanov V.V., Senko O.V. RECOGNITION. Mathematical methods. Software system. Practical Applications. - M .: Ed. "Phasis", 2006. - 176 p. - ISBN 5-7036-0108-8

Zinoviev A. Yu. Visualizing Multidimensional Data. - Krasnoyarsk: Ed. Krasnoyarsk State Technical University, 2000. - 180 p.

Chubukova I. A. Data Mining: A Tutorial. - M .: Internet University of Information Technologies: BINOM: Knowledge Laboratory, 2006. - 382 p. - ISBN 5-9556-0064-7

Ian H. Witten, Eibe Frank and Mark A. Hall Data Mining: Practical Machine Learning Tools and Techniques. - 3rd edition. - Morgan Kaufmann, 2011. - P. 664. - ISBN 9780123748560

Links

Data Mining Software in the Open Directory Project (dmoz) links directory.

Data Mining and Machine Learning
	Weka GNU R KNIME Rapid Miner Gretl PSPP
Proprietary	Deductor Statistica SPSS

Wikimedia Foundation. 2010 .

Data Mining

Data Mining is a methodology and process for discovering large amounts of data that accumulate in information systems companies, previously unknown, non-trivial, practically useful and accessible for the interpretation of knowledge necessary for decision-making in various areas of human activity. Data Mining is one of the stages of the larger Knowledge Discovery in Databases methodology.

The knowledge discovered in the process of Data Mining must be non-trivial and previously unknown. Non-triviality suggests that such knowledge cannot be discovered by simple visual analysis. They should describe relationships between the properties of business objects, predict the values of some features based on others, and so on. Found knowledge should be applicable to new objects.

The practical usefulness of knowledge is due to the possibility of their use in the process of supporting managerial decision-making and improving the company's activities.

Knowledge should be presented in a form that is understandable to users who do not have special mathematical training. For example, the logical constructions “if, then” are most easily perceived by a person. Moreover, such rules can be used in various DBMS as SQL queries. In the case when the extracted knowledge is not transparent to the user, there should be post-processing methods that allow them to be brought to an interpretable form.

Data mining is not one, but a combination of a large number of different knowledge discovery methods. All tasks solved by Data Mining methods can be conditionally divided into six types:

Data Mining is multidisciplinary in nature, as it includes elements of numerical methods, mathematical statistics and probability theory, information theory and mathematical logic, artificial intelligence and machine learning.

The tasks of business analysis are formulated in different ways, but the solution of most of them comes down to one or another Data Mining task or to a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are elements from which you can "assemble" the solution to most real business problems.