The olap tools are for. Analytical OLAP systems. Storing active data in a relational database

The OLAP mechanism is one of the most popular data analysis methods today. There are two main approaches to solving this problem. The first of them is called Multidimensional OLAP (MOLAP) - the implementation of the mechanism using a multidimensional database on the server side, and the second Relational OLAP (ROLAP) - building cubes on the fly based on SQL queries to a relational DBMS. Each of these approaches has its pros and cons. Their comparative analysis is beyond the scope of this article. We will describe our implementation of the core of the desktop ROLAP module.

Such a task arose after using the ROLAP system, built on the basis of the Decision Cube components that are part of Borland Delphi. Unfortunately, the use of this set of components showed poor performance on large amounts of data. This problem can be reduced by trying to cut off as much data as possible before submitting it to building cubes. But this is not always enough.

On the Internet and in the press, you can find a lot of information about OLAP systems, but almost nowhere is it said about how it works inside. Therefore, the solution of most problems was given to us by trial and error.

Scheme of work

The general scheme of the desktop OLAP system can be represented as follows:

The algorithm of work is the following:

  1. Getting data in the form of a flat table or the result of an SQL query.
  2. Caching data and converting it to a multidimensional cube.
  3. Displaying the constructed cube using a crosstab or chart, etc. In general, an arbitrary number of mappings can be connected to one cube.

Consider how such a system can be arranged internally. Let's start from the side that you can see and feel, that is, from the mappings.

Displays used in OLAP systems are most often of two types - cross tables and charts. Consider a crosstab, which is the primary and most common way to display a cube.

cross table

In the figure below, rows and columns containing aggregated results are shown in yellow, cells containing facts are marked in light gray, and cells containing dimension data are marked in dark gray.

Thus, the table can be divided into the following elements, with which we will work in the future:

Filling in the matrix with facts, we must proceed as follows:

  • Based on the measurement data, determine the coordinates of the added element in the matrix.
  • Determine the coordinates of the columns and rows of the totals affected by the element being added.
  • Add an element to the matrix and the corresponding columns and rows of totals.

At the same time, it should be noted that the resulting matrix will be very sparse, which is why its organization in the form of a two-dimensional array (the variant lying on the surface) is not only irrational, but, most likely, impossible due to the large dimension of this matrix, which cannot be stored. no amount is enough random access memory. For example, if our cube contains sales information for one year, and if it has only 3 dimensions - Customers (250), Products (500) and Date (365), then we will get a fact matrix of the following dimensions:

Number of elements = 250 x 500 x 365 = 45,625,000

And this despite the fact that there can be only a few thousand filled elements in the matrix. Moreover, the greater the number of dimensions, the more sparse the matrix will be.

Therefore, to work with this matrix, you need to apply special mechanisms for working with sparse matrices. Various options for organizing a sparse matrix are possible. They are fairly well documented in programming literature, such as the first volume of Donald Knuth's classic The Art of Programming.

Let us now consider how we can determine the coordinates of a fact, knowing the dimensions corresponding to it. To do this, let's take a closer look at the header structure:

In this case, you can easily find a way to determine the numbers of the corresponding cell and the totals in which it falls. Several approaches can be suggested here. One of them is using the tree to find matching cells. This tree can be built by passing through the sample. In addition, an analytical recursive formula can be easily defined to calculate the required coordinate.

Data preparation

The data stored in the table needs to be converted in order to be used. So, in order to improve performance when building a hypercube, it is desirable to find unique elements stored in columns that are dimensions of the cube. In addition, you can pre-aggregate facts for records that have the same dimension values. As mentioned above, the unique values ​​​​available in the dimension fields are important for us. Then the following structure can be proposed for storing them:

By using this structure, we significantly reduce the need for memory. Which is quite relevant, because. To increase the speed of work, it is desirable to store data in RAM. In addition, you can only store an array of elements, and upload their values ​​to disk, since we will need them only when displaying the crosstab.

CubeBase Component Library

The ideas described above were the basis for creating the CubeBase component library.

TCubeSource performs caching and data conversion into an internal format, as well as preliminary data aggregation. Component TСubeEngine performs the calculation of the hypercube and operations with it. In fact, it is an OLAP machine that converts a flat table into a multidimensional data set. Component TCubeGrid performs displaying the crosstab and controlling the display of the hypercube. TСubeChart allows you to see the hypercube in the form of graphs, and the component TСubePivote controls the operation of the cube core.

Performance Comparison

This set of components showed a much higher performance than the Decision Cube. So, on a set of 45 thousand records, the Decision Cube components required 8 minutes. to build a pivot table. CubeBase loaded the data in 7 seconds. and building a pivot table in 4 seconds. When testing for 700 thousand records of the Decision Cube, we did not wait for a response within 30 minutes, after which we canceled the task. CubeBase loaded the data in 45 seconds. and building a cube in 15 seconds.

On data volumes of thousands of records, CubeBase worked ten times faster than Decision Cube. On tables with hundreds of thousands of records - hundreds of times faster. And high performance is one of the most important indicators of OLAP systems.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Course work

by discipline: Databases

Topic: TechnologyOLAP

Completed:

Chizhikov Alexander Alexandrovich

Introduction

1. Classification of OLAP products

2. OLAP client - OLAP server: pros and cons

3. The core of the OLAP system

3.1 Construction principles

Conclusion

List of sources used

Applications

INconducting

It is difficult to find a person in the computer world who, at least on an intuitive level, did not understand what databases are and why they are needed. Unlike traditional relational DBMS, the concept of OLAP is not as widely known, although the cryptic term "OLAP cubes" has probably been heard by almost everyone. What is OnLine Analytical Processing?

OLAP is not a single software product, not a programming language, and not even a specific technology. If you try to cover OLAP in all its manifestations, then this is a set of concepts, principles and requirements that underlie software products that make it easier for analysts to access data. Although hardly anyone will disagree with such a definition, it is doubtful that it will bring non-specialists one iota closer to understanding the subject. Therefore, in your desire for knowledge of OLAP, it is better to go the other way. First you need to find out why analysts need to somehow specifically facilitate access to data.

The fact is that analysts are special consumers of corporate information. The task of an analyst is to find patterns in large data sets. Therefore, the analyst will not pay attention to a single fact, he needs information about hundreds and thousands of events. By the way, one of the essential points that led to the emergence of OLAP is productivity and efficiency. Let's imagine what happens when an analyst needs to get information, and OLAP tools are not available in the enterprise. The analyst independently (which is unlikely) or with the help of a programmer makes an appropriate SQL query and receives the data of interest in the form of a report or exports it to a spreadsheet. There are a lot of problems with this. Firstly, the analyst is forced to do something other than his job (SQL programming) or wait for programmers to do the task for him - all this negatively affects labor productivity, increases the heart attack and stroke rate, and so on. Secondly, a single report or table, as a rule, does not save the giants of thought and the fathers of Russian analysis - and the whole procedure will have to be repeated again and again. Thirdly, as we have already found out, analysts do not ask for trifles - they need everything at once. This means (although the technology is advancing by leaps and bounds) that the enterprise relational database server accessed by the analyst can think deeply and for a long time, blocking the rest of the transactions.

The concept of OLAP appeared precisely to solve such problems. OLAP cubes are essentially meta-reports. By cutting meta-reports (cubes, that is) by dimensions, the analyst actually receives the "regular" two-dimensional reports of interest to him (these are not necessarily reports in the usual sense of the term - we are talking about data structures with the same functions). The advantages of cubes are obvious - data needs to be requested from a relational DBMS only once - when building a cube. Since analysts, as a rule, do not work with information that is supplemented and changed on the fly, the generated cube is relevant for quite a long time. Thanks to this, interruptions in the operation of the relational DBMS server are not only eliminated (there are no queries with thousands and millions of response lines), but the data access speed for the analyst himself is also dramatically increased. In addition, as already noted, performance is also improved by calculating subsums of hierarchies and other aggregated values ​​at the time the cube is built.

Of course, you have to pay for increasing productivity in this way. It is sometimes said that a data structure just "explodes" - OLAP cube can take tens and even hundreds of times more space than the original data.

Now that we have figured out a little about how OLAP works and what it is for, it’s worth, nevertheless, to formalize our knowledge somewhat and give OLAP criteria without simultaneous translation into ordinary human language. These criteria (12 in total) were formulated in 1993 by E.F. Codd - the creator of the concept of relational DBMS and, concurrently, OLAP. We will not consider them directly, since they were later reworked into the so-called FASMI test, which defines the requirements for OLAP products. FASMI is an abbreviation for the name of each test item:

Fast (fast). This property means that the system should respond to a user request in an average of five seconds; however, most requests are processed within one second, and the most complex requests should be processed within twenty seconds. Recent studies have shown that the user begins to doubt the success of the request if it takes more than thirty seconds.

Analysis (analytical). The system must be able to handle any logical and statistical analysis that is specific to business applications and ensure that the results are stored in a form accessible to the end user. Analysis tools may include procedures for time series analysis, cost allocation, currency conversion, modeling changes in organizational structures, and some others.

Shared (shared). The system should provide ample opportunities to restrict access to data and the simultaneous work of many users.

Multidimensional (multidimensional). The system should provide a conceptually multidimensional representation of data, including full support for multiple hierarchies.

Information (information). The power of various software products is characterized by the amount of input data processed. Different OLAP systems have different capacities: advanced OLAP solutions can handle at least a thousand times more data than the smallest ones. When choosing an OLAP tool, there are a number of factors to consider, including data duplication, required RAM, disk space usage, performance, integration with information stores, and so on.

1. Classification of OLAP products

So, the essence of OLAP lies in the fact that the initial information for analysis is presented in the form of a multidimensional cube, and it is possible to arbitrarily manipulate it and obtain the necessary information sections - reports. At the same time, the end user sees the cube as a multidimensional dynamic table that automatically summarizes data (facts) in various sections (dimensions), and allows you to interactively manage calculations and the form of the report. These operations are provided by the OLAP engine (or OLAP computing engine).

To date, many products have been developed in the world that implement OLAP technologies. To make it easier to navigate among them, they use classifications of OLAP products: by the method of storing data for analysis and by the location of the OLAP machine. Let's take a closer look at each category of OLAP products.

I'll start with a classification by the way data is stored. Let me remind you that multidimensional cubes are built on the basis of source and aggregate data. Both source and aggregate data for cubes can be stored in both relational and multidimensional databases. Therefore, three data storage methods are currently used: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP), and HOLAP (Hybrid OLAP). Accordingly, OLAP products are divided into three similar categories according to the method of data storage:

1.In the case of MOLAP, the source and aggregate data are stored in a multidimensional database or in a multidimensional local cube.

2. In ROLAP products, source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. The transformation of data from a relational database into multidimensional cubes occurs at the request of an OLAP tool.

3. In the case of using the HOLAP architecture, the source data remains in the relational database, while the aggregates are placed in the multidimensional one. An OLAP cube is built at the request of an OLAP tool based on relational and multidimensional data.

The next classification is based on the location of the OLAP machine. On this basis, OLAP products are divided into OLAP servers and OLAP clients:

In server OLAP tools, calculations and storage of aggregate data are performed by a separate process - the server. The client application only receives the results of queries against multidimensional cubes that are stored on the server. Some OLAP servers only support relational databases, some only multidimensional databases. Many modern OLAP servers support all three data storage methods: MOLAP, ROLAP, and HOLAP.

The OLAP client is designed differently. Building a multidimensional cube and OLAP calculations are performed in memory client computer. OLAP clients are also divided into ROLAP and MOLAP. And some may support both data access options.

Each of these approaches has its pros and cons. Contrary to popular belief about the advantages of server tools over client tools, in a number of cases, using an OLAP client for users can be more efficient and more profitable than using an OLAP server.

2. OLAP client - OLAP server: pros and cons

When building information system OLAP functionality can be implemented by both server and client OLAP tools. In practice, the choice is the result of a compromise between performance and software cost.

The amount of data is determined by a combination of the following characteristics: the number of records, the number of dimensions, the number of dimension elements, the length of dimensions, and the number of facts. It is known that an OLAP server can process larger amounts of data than an OLAP client with equal computer power. This is because the OLAP server stores a multidimensional database on hard drives that contains precomputed cubes.

Client programs at the time of performing OLAP operations execute queries on it in an SQL-like language, receiving not the entire cube, but its displayed fragments. The OLAP client must have the entire cube in RAM at the time of operation. In the case of the ROLAP architecture, the entire data array used to calculate the cube must first be loaded into memory. Also, as the number of dimensions, facts, or dimension members increases, the number of aggregations grows exponentially. Thus, the amount of data processed by the OLAP client is directly dependent on the amount of RAM on the user's PC.

Note, however, that most OLAP clients provide distributed computing. Therefore, the number of processed records, which limits the work of the client OLAP tool, is understood not as the volume of primary data of the corporate database, but as the size of the aggregated sample from it. The OLAP client generates a query to the DBMS, which describes the filtering conditions and the algorithm for preliminary grouping of primary data. The server finds, groups records and returns a compact selection for further OLAP calculations. The size of this sample can be tens and hundreds of times smaller than the volume of primary, non-aggregated records. Consequently, the need for such an OLAP client in PC resources is significantly reduced.

In addition, the number of dimensions is limited by the possibilities of human perception. It is known that the average person can simultaneously operate 3-4, maximum 8 dimensions. With a larger number of dimensions in the dynamic table, the perception of information is significantly more difficult. This factor should be taken into account when pre-calculating the RAM that an OLAP client may need.

The length of the dimensions also affects the size of the OLAP tool's address space used when calculating the OLAP cube. The longer the dimensions, the more resources are required to presort a multidimensional array, and vice versa. Only short measurements in the source data is another argument in favor of the OLAP client.

This characteristic is determined by the two factors discussed above: the amount of data being processed and the power of computers. With an increase in the number of, for example, dimensions, the performance of all OLAP tools decreases due to a significant increase in the number of aggregates, but at the same time, the rate of decrease is different. Let's demonstrate this dependence on the graph.

Diagram 1. Dependence of the performance of client and server OLAP tools on data growth

The performance characteristics of an OLAP server are less sensitive to data growth. This is due to different technologies for processing user requests by the OLAP server and the OLAP client. For example, during a drill operation, the OLAP server accesses the stored data and "pulls" the data of this "branch". The OLAP client, on the other hand, calculates the entire set of aggregates at the time of loading. However, up to a certain amount of data, the performance of server and client tools is comparable. For OLAP clients that support distributed computing, the area of ​​performance comparability can extend to data volumes that cover the OLAP analysis needs of a huge number of users. This is confirmed by the results of internal testing of MS OLAP Server and OLAP client "Contour Standard". The test was performed on an IBM PC Pentium Celeron 400 MHz, 256 Mb for a sample of 1 million unique (ie aggregated) records with 7 dimensions containing from 10 to 70 members. The cube loading time in both cases does not exceed 1 second, and the execution of various OLAP operations (drill up, drill down, move, filter, etc.) is performed in hundredths of a second.

When the sample size exceeds the amount of RAM, swapping with the disk begins and the performance of the OLAP client drops sharply. Only from this point on can we talk about the advantage of the OLAP server.

It should be remembered that the "breaking point" defines the boundary of a sharp increase in the cost of an OLAP solution. For the tasks of each specific user, this point is easily determined by performance tests of the OLAP client. Such tests can be obtained from the developer company.

In addition, the cost of a server-side OLAP solution increases as the number of users increases. The fact is that the OLAP server performs calculations for all users on the same computer. Accordingly, the greater the number of users, the more RAM and processor power. Thus, if the volumes of processed data lie in the area of ​​comparable performance of server and client systems, then, other things being equal, the use of an OLAP client will be more profitable.

Using an OLAP server in the "classic" ideology involves uploading data from relational DBMSs to a multidimensional database. The upload is performed for a certain period, so the OLAP server data does not reflect the state at the current moment. Only those OLAP servers that support the ROLAP mode of operation are deprived of this shortcoming.

Similarly, a number of OLAP clients allow for ROLAP and Desktop architectures with direct database access. This provides on-line analysis of initial data.

OLAP server presents minimum requirements to the capacity of client terminals. Objectively, the requirements of the OLAP client are higher, because it performs calculations in the RAM of the user's PC. The condition of a particular organization's hardware fleet is a critical factor to consider when choosing an OLAP tool. But it also has its pros and cons. The OLAP server does not use the huge computing power of modern personal computers. If an organization already has a fleet of modern PCs, it is inefficient to use them only as display terminals and at the same time incur additional costs for a central server.

If the power of users' computers "leaves much to be desired", the OLAP client will run slowly or not work at all. Buying one powerful server can be cheaper than upgrading all PCs.

Here it is useful to take into account trends in hardware development. Since the amount of data for analysis is practically constant, the steady growth in PC power will lead to the expansion of the capabilities of OLAP clients and the displacement of OLAP servers into the segment of very large databases.

When using an OLAP server, only display data is sent over the network to the client PC, while the OLAP client receives all of the primary sample data.

Therefore, where an OLAP client is used, network traffic will be higher.

But, when using an OLAP server, user operations, for example, drill down, generate new queries to a multidimensional database, and, therefore, a new data transfer. The execution of OLAP operations by the OLAP client is performed in RAM and, accordingly, does not cause new data flows in the network.

It should also be noted that modern network hardware provides a high level of throughput.

Therefore, in the overwhelming majority of cases, the analysis of a database of "medium" sizes using an OLAP client will not slow down the user's work.

The cost of an OLAP server is quite high. This should also include the cost of a dedicated computer and the fixed costs of administering a multidimensional database. In addition, the implementation and maintenance of an OLAP server requires sufficiently high qualifications from the staff.

The cost of an OLAP client is an order of magnitude lower than the cost of an OLAP server. Administration and additional technical equipment for the server is not required. There are no high requirements for the qualification of personnel when implementing an OLAP client. An OLAP client can be deployed much faster than an OLAP server.

Development of analytical applications using client OLAP tools is a fast process that does not require special training of the performer. A user who knows the physical implementation of the database can develop an analytical application on their own, without the involvement of an IT specialist. When using an OLAP server, you need to learn 2 different systems, sometimes from different vendors, to create cubes on the server, and to develop a client application. The OLAP client provides a single visual interface for describing cubes and customizing user interfaces for them.

Consider the process of creating an OLAP application using the client tool.

Diagram 2: Create an OLAP application using the ROLAP client tool

The principle of operation of ROLAP clients is a preliminary description of the semantic layer behind which the physical structure of the source data is hidden. In this case, data sources can be: local tables, RDBMS. The list of supported data sources is determined by the specific software product. After that, the user can independently manipulate the objects that he understands in terms of the subject area to create cubes and analytical interfaces.

The principle of operation of the OLAP server client is different. In the OLAP server, when creating cubes, the user manipulates the physical descriptions of the database.

This creates custom descriptions in the cube itself. The OLAP Server client is configured for cube only.

Let's explain how the ROLAP client works using the example of creating a dynamic sales report (see Figure 2). Let the initial data for analysis be stored in two tables: Sales and Deal.

When creating a semantic layer, the data sources - the Sales and Deal tables - are described in terms understandable to the end user and turn into "Products" and "Deals". The "ID" field from the "Products" table is renamed to "Code", and the "Name" to "Product", and so on.

Then a Sales business object is created. A business object is a flat table on the basis of which a multidimensional cube is formed. When creating a business object, the tables "Products" and "Deals" are combined by the field "Code" of the product. Because the report does not need to display all the fields in the tables, the business object only uses the Item, Date, and Amount fields.

Next, an OLAP report is created based on the business object. The user selects a business object and drags its attributes into the columns or rows area of ​​the report table. In our example, based on the "Sales" business object, a report was created on sales of goods by months.

When working with an interactive report, the user can set filtering and grouping conditions with the same simple mouse movements. At this point, the ROLAP client accesses the data in the cache. The client of the OLAP server generates a new query to the multidimensional database. For example, by applying a product filter in the sales report, you can get a report on the sales of products of interest to us.

All settings for an OLAP application can be stored in a dedicated metadata repository, in an application, or in a multidimensional database system repository. Implementation depends on the specific software product.

So, in what cases can the use of an OLAP client for users be more efficient and beneficial than using an OLAP server?

The economic feasibility of using an OLAP server arises when the amount of data is very large and unbearable for an OLAP client, otherwise the use of the latter is more justified. In this case, the OLAP client combines high performance and low cost.

Powerful analysts' PCs are another argument in favor of OLAP clients. When using an OLAP server, these capacities are not used. Other benefits of OLAP clients include:

The cost of implementing and maintaining an OLAP client is significantly lower than the cost of an OLAP server.

When using an OLAP client with an embedded machine, data transfer over the network occurs once. When performing OLAP operations, no new data streams are generated.

Setting up ROLAP clients is simplified by eliminating the intermediate link - creating a multidimensional database.

3. The core of the OLAP system

3.1 Construction principles

application client core data

From what has already been said, it is clear that the OLAP mechanism is one of the most popular data analysis methods today. There are two main approaches to solving this problem. The first of them is called Multidimensional OLAP (MOLAP) - the implementation of the mechanism using a multidimensional database on the server side, and the second Relational OLAP (ROLAP) - building cubes on the fly based on SQL queries to a relational DBMS. Each of these approaches has its pros and cons. Their comparative analysis is beyond the scope of this work. Only the core implementation of the desktop ROLAP module will be described here.

Such a task arose after using the ROLAP system, built on the basis of the Decision Cube components that are part of Borland Delphi. Unfortunately, the use of this set of components showed poor performance on large amounts of data. This problem can be reduced by trying to cut off as much data as possible before submitting it to building cubes. But this is not always enough.

On the Internet and in the press, you can find a lot of information about OLAP systems, but almost nowhere is it said about how it works inside.

Scheme of work:

The general scheme of the desktop OLAP system can be represented as follows:

Scheme 3. Work of a desktop OLAP system

The algorithm of work is the following:

1. Obtaining data in the form of a flat table or the result of an SQL query.

2. Caching data and converting them to a multidimensional cube.

3. Displaying the constructed cube using a cross-tab or chart, etc. In general, an arbitrary number of mappings can be connected to one cube.

Consider how such a system can be arranged internally. Let's start from the side that you can see and feel, that is, from the mappings. Displays used in OLAP systems are most often of two types - cross tables and charts. Consider a crosstab, which is the primary and most common way to display a cube.

In the figure below, rows and columns containing aggregated results are shown in yellow, cells containing facts are marked in light gray, and cells containing dimension data are marked in dark gray.

Thus, the table can be divided into the following elements, with which we will work in the future:

Filling in the matrix with facts, we must proceed as follows:

Based on the measurement data, determine the coordinates of the added element in the matrix.

Determine the coordinates of the columns and rows of the totals affected by the element being added.

Add an element to the matrix and the corresponding columns and rows of totals.

At the same time, it should be noted that the resulting matrix will be very sparse, which is why its organization in the form of a two-dimensional array (the variant lying on the surface) is not only irrational, but, most likely, impossible due to the large dimension of this matrix, which cannot be stored. no amount of RAM is enough. For example, if our cube contains sales information for one year, and if it has only 3 dimensions - Customers (250), Products (500) and Date (365), then we will get a matrix of facts of the following dimensions: number of elements = 250 x 500 x 365 = 45,625,000. And this despite the fact that there can be only a few thousand filled elements in the matrix. Moreover, the greater the number of dimensions, the more sparse the matrix will be.

Therefore, to work with this matrix, you need to apply special mechanisms for working with sparse matrices. Various options for organizing a sparse matrix are possible. They are fairly well documented in programming literature, such as the first volume of Donald Knuth's classic The Art of Programming.

Let us now consider how we can determine the coordinates of a fact, knowing the dimensions corresponding to it. To do this, let's take a closer look at the header structure:

In this case, you can easily find a way to determine the numbers of the corresponding cell and the totals in which it falls. Several approaches can be suggested here. One is to use the tree to find matching cells. This tree can be built by passing through the sample. In addition, an analytical recursive formula can be easily defined to calculate the required coordinate.

The data stored in the table needs to be converted in order to be used. So, in order to improve performance when building a hypercube, it is desirable to find unique elements stored in columns that are dimensions of the cube. In addition, you can pre-aggregate facts for records that have the same dimension values. As mentioned above, the unique values ​​​​available in the dimension fields are important for us. Then the following structure can be proposed for storing them:

Scheme 4. Structure for storing unique values

By using this structure, we significantly reduce the need for memory. Which is quite relevant, because. To increase the speed of work, it is desirable to store data in RAM. In addition, you can only store an array of elements, and upload their values ​​to disk, since we will need them only when displaying the crosstab.

The ideas described above were the basis for creating the CubeBase component library.

Scheme 5. The structure of the CubeBase component library

TСubeSource performs caching and data conversion into an internal format, as well as preliminary data aggregation. The TСubeEngine component calculates the hypercube and performs operations with it. In fact, it is an OLAP machine that converts a flat table into a multidimensional data set. The TCubeGrid component displays the crosstab and controls the display of the hypercube. TСubeChart allows you to see the hypercube in the form of graphs, and the TСubePivote component controls the operation of the cube core.

So, I have considered the architecture and interaction of components that can be used to build an OLAP machine. Now let's take a closer look at the internal structure of the components.

The first stage of the system will be loading the data and converting it into an internal format. The question will be logical - why is this necessary, because you can simply use data from a flat table, viewing it when building a cube slice. In order to answer this question, let's consider the table structure from the point of view of an OLAP machine. For an OLAP system, table columns can be either facts or dimensions. In this case, the logic of working with these columns will be different. In a hypercube, the dimensions are actually the axes, and the values ​​of the dimensions are the coordinates on those axes. In this case, the cube will be filled very unevenly - there will be combinations of coordinates that will not correspond to any records and there will be combinations that correspond to several records in the source table, and the first situation is more common, that is, the cube will look like the universe - empty space, in separate places which clusters of points (facts) occur. Thus, if we pre-aggregate data during the initial data loading, that is, we combine records that have the same measurement values, while calculating the preliminary aggregated fact values, then in the future we will have to work with fewer records, which will increase the speed of work and reduce the requirements to the amount of RAM.

To build slices of a hypercube, we need the following features - determining the coordinates (in fact, measurement values) for table entries, as well as determining records that have specific coordinates (measurement values). Let's see how these possibilities can be realized. The easiest way to store a hypercube is to use a database of its own internal format.

Schematically, the transformations can be represented as follows:

Diagram 6. Converting an internal format database to a normalized database

That is, instead of one table, we got a normalized database. In fact, normalization reduces the speed of the system, database experts can say, and in this they will certainly be right when we need to get values ​​for elements of dictionaries (in our case, dimension values). But the thing is that we do not need these values ​​at the stage of constructing a slice at all. As mentioned above, we are only interested in the coordinates in our hypercube, so we will define the coordinates for the measurement values. The simplest would be to renumber the values ​​of the elements. In order for the numbering to be unambiguous within one dimension, we first sort the lists of measurement values ​​(dictionaries, expressed in terms of the database) in alphabetical order. In addition, we renumber the facts, and the facts are preaggregated. We get the following diagram:

Scheme 7. Renumbering the normalized database to determine the coordinates of measurement values

Now it remains only to link the elements of different tables to each other. In relational database theory, this is done using special intermediate tables. It is enough for us to match each entry in the dimension tables with a list whose elements will be the numbers of facts in the formation of which these dimensions were used (that is, to determine all the facts that have the same value of the coordinate described by this dimension). For the facts, respectively, for each record, we put in correspondence the values ​​of the coordinates along which it is located in the hypercube. In what follows, everywhere under the coordinates of an entry in the hypercube we mean the numbers of the corresponding entries in the tables of measurement values. Then for our hypothetical example, we get the following set that defines the internal representation of the hypercube:

Scheme 8. Internal representation of a hypercube

This will be our internal representation of the hypercube. Since we are not doing it for a relational database, then just fields of variable length are used as fields for linking measurement values ​​(we could not do this in the RDB, since the number of table columns is predetermined there).

One could try to use a set of temporary tables to implement a hypercube, but this method will provide too low performance (an example is a set of Decision Cube components), so we will use our own data storage structures.

To implement a hypercube, we need to use data structures that will provide maximum performance and minimum memory consumption. Obviously, our main structures will be for storing dictionaries and fact tables. Consider the tasks that the dictionary should perform at maximum speed:

checking for the presence of an element in the dictionary;

adding an element to the dictionary;

search for record numbers that have a specific coordinate value;

search for coordinates by measurement value;

search for a measurement value by its coordinate.

To meet these requirements, you can use different types and data structures. For example, you can use arrays of structures. In the real case, these arrays require additional indexing mechanisms that will increase the speed of loading data and obtaining information.

To optimize the operation of the hypercube, it is necessary to determine which tasks need to be addressed as a matter of priority, and by what criteria we need to achieve an increase in the quality of work. The main thing for us is to increase the speed of the program, while it is desirable that not a very large amount of RAM is required. Performance improvement is possible due to the introduction of additional data access mechanisms, for example, the introduction of indexing. Unfortunately, this increases the overhead of RAM. Therefore, we determine which operations we need to perform with the greatest speed. To do this, consider the individual components that implement the hypercube. These components are of two main types - dimension and fact table. For measurement, a typical task would be:

adding a new value;

determination of the coordinate by the value of the measurement;

determination of the value by coordinate.

When adding a new element value, we need to check if we already have such a value, and if so, do not add a new one, but use the existing coordinate, otherwise we need to add a new element and determine its coordinate. This requires a way to quickly find the presence of the desired element (in addition, such a problem also arises when determining the coordinate by the value of the element). For this, hashing will be optimal. In this case, the optimal structure will be the use of hash trees, in which we will store links to elements. In this case, the elements will be the strings of the dimension dictionary. Then the structure of the dimension value can be represented as follows:

PFactLink = ^TFactLink;

TFactLink = record

FactNo: integer; // fact index in the table

TDimensionRecord = record

Value: string; // measurement value

index: integer; // coordinate value

FactLink: PFactLink; // pointer to the beginning of the list of elements of the fact table

And in the hash tree we will store links to unique elements. In addition, we need to solve the problem of the inverse transformation - to determine the value of the measurement by the coordinate. For best performance, direct addressing should be used. Therefore, you can use another array, the index in which is the coordinate of the measurement, and the value is a link to the corresponding entry in the dictionary. However, you can do it easier (and save on memory at the same time) if you arrange the array of elements appropriately so that the element's index is its coordinate.

The organization of an array that implements a list of facts does not present any particular problems due to its simple structure. The only remark will be that it is desirable to calculate all the aggregation methods that may be needed and which can be calculated incrementally (for example, the sum).

So, we have described a way to store data in the form of a hypercube. It allows you to generate a set of points in a multidimensional space based on the information in the data warehouse. In order for a person to be able to work with these data, they must be presented in a form convenient for processing. At the same time, a pivot table and graphs are used as the main types of data presentation. Moreover, both of these methods are actually projections of the hypercube. In order to ensure maximum efficiency in constructing representations, we will start from what these projections represent. Let's start with the pivot table, as the most important for data analysis.

Let us find ways to implement such a structure. There are three parts that make up a PivotTable: row headers, column headers, and the actual aggregated fact table. by the most in a simple way fact table views will use a two-dimensional array, the dimension of which can be determined by building the headers. Unfortunately, the simplest way will be the most inefficient, because the table will be very sparse and the memory will be extremely inefficient, resulting in only very small cubes being built, otherwise there may not be enough memory. Thus, we need to choose a data structure for storing information that will provide the maximum speed of searching / adding a new element and at the same time the minimum consumption of RAM. This structure will be the so-called sparse matrices, about which you can read more about Knuth. There are various ways to organize the matrix. In order to choose the option that suits us, we first consider the structure of the table headers.

Headers have a clear hierarchical structure, so it would be natural to use a tree to store them. In this case, the structure of the tree node can be schematically depicted as follows:

Appendix C

At the same time, it is logical to store a reference to the corresponding element of the table of measurements of a multidimensional cube as a dimension value. This will reduce the memory consumption for storing the slice and speed up the work. Links are also used as parent and child nodes.

To add an element to the tree, it is necessary to have information about its location in the hypercube. As such information it is necessary to use its coordinate, which is stored in the Dictionary of Dimension Values. Let's consider the scheme of adding an element to the header tree of a pivot table. In this case, we use the values ​​of measurement coordinates as initial information. The order in which these dimensions are listed is determined by the desired aggregation method and matches the hierarchy levels of the header tree. As a result of the work, you need to get a list of columns or rows of the pivot table in which you want to add an element.

AppendixD

We use the measurement coordinates as the initial data for determining this structure. In addition, for definiteness, we will assume that we are defining the column of interest to us in the matrix (we will consider how we will determine the row a little later, since it is more convenient to use other data structures there, see the reason for this choice also below). As coordinates, we take integers - the numbers of measurement values ​​that can be determined as described above.

So, after executing this procedure, we get an array of links to the columns of the sparse matrix. Now you need to perform all the necessary actions with the lines. To do this, inside each column, you need to find the desired element and add the corresponding value there. For each of the dimensions in the collection, you need to know the number of unique values ​​and the actual set of these values.

Now let's look at how to represent the values ​​inside the columns - that is, how to determine the required row. To do this, several approaches can be used. The simplest would be to represent each column as a vector, but since it will be very sparse, the memory will be extremely inefficient. To avoid this, we use data structures that will provide more efficient representation of sparse one-dimensional arrays (vectors). The simplest of these would be a normal list, singly or doubly linked, but not economical in terms of element access. Therefore, we will use a tree, which will provide faster access to the elements.

For example, you can use exactly the same tree as for columns, but then you would have to create your own tree for each column, which would lead to significant overhead in memory and processing time. Let's do a little more tricky - let's start one tree to store all combinations of measurements used in rows, which will be identical to the one described above, but its elements will not be pointers to rows (which do not exist as such), but their indices, and the index values ​​themselves are not of interest to us and are used only as unique keys. Then we will use these keys to find the desired element within the column. The columns themselves are easiest to represent as a regular binary tree. Graphically, the resulting structure can be represented as follows:

Diagram 9. Image of a pivot table in the form of a binary tree

To determine the corresponding row numbers, you can use the same procedure as the procedure described above for determining the columns of the pivot table. In this case, the row numbers are unique within one pivot table and identify elements in vectors that are columns of the pivot table. The simplest way to generate these numbers would be to maintain a counter and increment it by one when a new element is added to the row header tree. These column vectors themselves are most easily stored as binary trees, where the value of the row number is used as the key. In addition, the use of hash tables is also possible. Since the procedures for working with these trees are discussed in detail in other sources, we will not dwell on this and consider the general scheme for adding an element to a column.

In a generalized form, the sequence of actions for adding an element to the matrix can be described as follows:

1. Determine the line numbers where elements are added

2.Define the set of columns where items are added

3. For all columns, find elements with the desired row numbers and add the current element to them (adding includes connecting the required number of fact values ​​and calculating aggregated values ​​that can be determined incrementally).

After executing this algorithm, we get a matrix, which is a pivot table that we needed to build.

Now a couple of words about filtering when building a slice. The easiest way to implement it is just at the stage of constructing the matrix, since at this stage there is access to all the required fields, and, in addition, the aggregation of values ​​is carried out. At the same time, when an entry is received from the cache, its compliance with the filtering conditions is checked, and in case of its non-compliance, the entry is discarded.

Since the structure described above completely describes the pivot table, the task of its visualization will be trivial. In this case, you can use standard table components that are available in almost all Windows programming tools.

The first product to perform OLAP queries was Express (an IRI company). However, the term OLAP itself was coined by Edgar Codd, "the father of relational databases". And Codd's work was funded by Arbor, the company that launched its own OLAP product, Essbase (later bought by Hyperion, which was taken over by Oracle in 2007) a year earlier. Other well-known OLAP products include Microsoft Analysis Services (formerly called OLAP Services, part SQL Server), Oracle OLAP Option, IBM's DB2 OLAP Server (actually, EssBase with IBM add-ons), SAP BW, products from Brio, BusinessObjects, Cognos, MicroStrategy, and others.

From a technical point of view, the products on the market are divided into "physical OLAP" and "virtual". In the first case, there is a program that performs a preliminary calculation of aggregates, which are then stored in a special multidimensional database that provides fast retrieval. Examples of such products are Microsoft Analysis Services, Oracle OLAP Option, Oracle/Hyperion EssBase, Cognos PowerPlay. In the second case, data is stored in relational DBMSs, while aggregates may not exist at all or be created on the first request in the DBMS or the analytical software cache. Examples of such products are SAP BW, BusinessObjects, Microstrategy. Systems based on "physical OLAP" provide consistently better query response times than "virtual OLAP" systems. Virtual OLAP vendors claim that their products are more scalable in terms of supporting very large amounts of data.

In this work, I would like to take a closer look at the product of BaseGroup Labs - Deductor.

Deductor is an analytical platform, i.e. the basis for creating complete application solutions. The technologies implemented in Deductor make it possible, on the basis of a single architecture, to go through all the stages of building an analytical system: from creating a data warehouse to automatically selecting models and visualizing the results.

System composition:

Deductor Studio is the analytical core of the Deductor platform. Deductor Studio includes a complete set of mechanisms that allows you to get information from an arbitrary data source, carry out the entire processing cycle (cleaning, data transformation, building models), display the results in the most convenient way (OLAP, tables, charts, decision trees ...) and export results.

The Deductor Viewer is the end user's workplace. The program allows you to minimize the requirements for personnel, because all the required operations are performed automatically using previously prepared processing scripts, there is no need to think about the method of obtaining data and the mechanisms for processing it. The Deductor Viewer user only needs to select the report of interest.

Deductor Warehouse is a multidimensional cross-platform data warehouse that accumulates all the information necessary for the analysis of the subject area. The use of a single repository allows for convenient access, high processing speed, consistency of information, centralized storage and automatic support for the entire data analysis process.

4.Client Server

Deductor Server is designed for remote analytical processing. It provides the ability to both automatically "run" data through existing scripts on the server, and retrain existing models. Using the Deductor Server allows you to implement a full-fledged three-tier architecture in which it acts as an application server. Access to the server is provided by the Deductor Client.

Work principles:

1. Data import

Analysis of any information in Deductor starts with data import. As a result of the import, the data is reduced to a form suitable for subsequent analysis using all the mechanisms available in the program. The nature of the data, format, DBMS, etc. do not matter, because mechanisms of work with all are unified.

2. Data export

The presence of export mechanisms allows you to send the results to third party applications, for example, transfer a sales forecast to the system for generating a purchase order or post a prepared report on a corporate website.

3. Data processing

Processing in Deductor means any action associated with some kind of data transformation, for example, filtering, building a model, cleaning, and so on. Actually, in this block, the most important actions from the point of view of analysis are performed. The most significant feature of the processing mechanisms implemented in Deductor is that the data obtained as a result of processing can be processed again by any of the methods available to the system. Thus, it is possible to build arbitrarily complex processing scenarios.

4. Visualization

You can visualize data in Deductor Studio (Viewer) at any stage of processing. The system independently determines how it can do this, for example, if a neural network is trained, then in addition to tables and diagrams, you can view the graph of the neural network. The user needs to select the desired option from the list and configure several parameters.

5. Integration mechanisms

Deductor does not provide data entry tools - the platform is focused exclusively on analytical processing. To use information stored in heterogeneous systems, flexible import-export mechanisms are provided. Interaction can be organized using batch execution, working in OLE server mode and accessing the Deductor Server.

6. Replication of knowledge

Deductor allows you to implement one of the most important functions of any analytical system - support for the knowledge replication process, i.e. providing an opportunity for employees who are not versed in the methods of analysis and ways to obtain a particular result, to receive an answer based on models prepared by an expert.

Zconclusion

In this paper, such an area of ​​modern information technologies as data analysis systems was considered. Analyzed the main tool for analytical processing of information - OLAP - technology. The essence of the concept of OLAP and the importance of OLAP-systems in the modern business process are disclosed in detail. The structure and operation of the ROLAP server are described in detail. As an example of the implementation of OLAP data - technologies, the analytical platform Deductor is given. The presented documentation has been developed and complies with the requirements.

OLAP technologies are a powerful real-time data processing tool. An OLAP server allows you to organize and present data across different analytics and turns data into valuable information that helps companies make better decisions.

The use of OLAP systems provides consistently high levels of performance and scalability, supporting gigabyte-sized data volumes that can be accessed by thousands of users. With the help of OLAP technologies, access to information is carried out in real time, i.e. query processing no longer slows down the analysis process, ensuring its efficiency and effectiveness. Visual administration tools allow you to develop and implement even the most complex analytical applications, making this process simple and fast.

Similar Documents

    The basis of the concept of OLAP (On-Line Analytical Processing) - online analytical data processing, the features of its use on the client and on the server. General characteristics of the basic requirements for OLAP systems, as well as ways to store data in them.

    abstract, added 10/12/2010

    OLAP: general characteristics, purpose, goals, tasks. Classification of OLAP products. Principles of building an OLAP system, CubeBase component library. The dependence of the performance of client and server OLAP tools on the increase in data volume.

    term paper, added 12/25/2013

    Eternal data storage. The essence and meaning of the OLAP (On-line Analytical Processing) tool. Databases and data storages, their characteristics. Structure, architecture of data storage, their providers. Some tips for improving the performance of OLAP cubes.

    test, added 10/23/2010

    Building data analysis systems. Building algorithms for designing an OLAP cube and creating queries to the built pivot table. OLAP technology for multidimensional data analysis. Providing users with information for making managerial decisions.

    term paper, added 09/19/2008

    Basic information about OLAP. Operational analytical data processing. Classification of OLAP products. Requirements for the means of operational analytical processing. The use of multidimensional databases in online analytical processing systems, their advantages.

    term paper, added 06/10/2011

    Development of website analysis subsystems using Microsoft Access and Olap technologies. Theoretical aspects of the development of a data analysis subsystem in the information system of a music portal. Olap-technologies in the analysis subsystem of the research object.

    term paper, added 11/06/2009

    Consideration of OLAP tools: classification of storefronts and information storages, the concept of a data cube. Decision support system architecture. Software implementation of the "Abitura" system. Creating a Web report using Reporting Services technologies.

    term paper, added 12/05/2012

    Data warehouse, organization principles. Data Processes. OLAP structure, technical aspects of multidimensional data storage. Integration Services, filling storages and data marts. Capabilities of systems using Microsoft technologies.

    term paper, added 12/05/2012

    Construction of a data warehouse scheme for a trade enterprise. Descriptions of storage relationship schemas. Display information about the product. Creation of an OLAP cube for further information analysis. Developing queries to evaluate the efficiency of the supermarket.

    control work, added 12/19/2015

    Purpose of data stores. SAP BW architecture. Building analytical reporting based on OLAP cubes in the SAP BW system. Main differences between data warehouse and OLTP system. An overview of the functional areas of BEx. Creating a query in BEx Query Designer.

In the series of articles "Introduction to databases", published recently (see ComputerPress No. 3'2000 - 3'2001), we discussed various technologies and software tools used to create information systems - desktop and server DBMS, data design tools , application development tools, as well as Business Intelligence - enterprise-wide data analysis and processing tools that are currently becoming more and more popular in the world, including in our country. Note, however, that the issues of using Business Intelligence tools and the technology used to create applications of this class are still insufficiently covered in the domestic literature. In a new series of articles, we will try to fill this gap and talk about what the technologies underlying these applications are. As implementation examples, we will mainly use Microsoft OLAP technologies (mainly Analysis Services in Microsoft SQL Server 2000), but we hope that the bulk of the material will be useful to users of other tools.

The first article in this series is devoted to the basics of OLAP (On-Line Analytical Processing) - a technology for multidimensional data analysis. In it, we'll cover the concepts of data warehousing and OLAP, the requirements for data warehousing and OLAP tools, the logical organization of OLAP data, and the basic terms and concepts used when discussing multidimensional analysis.

What is a data warehouse

Enterprise-scale information systems, as a rule, contain applications designed for complex multidimensional analysis of data, its dynamics, trends, etc. Such analysis is ultimately intended to assist decision making. Often these systems are called decision support systems.

It is impossible to make any managerial decision without possessing the information necessary for this, usually quantitative. This requires the creation of data warehouses, that is, the process of collecting, screening and pre-processing data in order to provide the resulting information to users for statistical analysis (and often the creation of analytical reports).

Ralph Kimball, one of the originators of the data warehouse concept, described the data warehouse as "a place where people can access their data" (see, for example, Ralph Kimball, "The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses", John Wiley & Sons, 1996 and "The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse", John Wiley & Sons, 2000). He also formulated the basic requirements for data warehouses:

  • support for high-speed data retrieval from storage;
  • support for internal data consistency;
  • the ability to obtain and compare the so-called data slices (slice and dice);
  • availability of convenient utilities for viewing data in the storage;
  • completeness and reliability of stored data;
  • support for a quality data replenishment process.

It is often not possible to satisfy all the listed requirements within the framework of the same product. Therefore, for the implementation of data warehouses, several products are usually used, some of which are actually means of storing data, others are means of extracting and viewing them, others are means of replenishing them, etc.

A typical data warehouse is usually different from a typical relational database. First, conventional databases are designed to help users do their day-to-day work, while data warehouses are designed to make decisions. For example, selling a product and issuing an invoice are made using a database designed to process transactions, and analyzing the dynamics of sales over several years, which allows you to plan work with suppliers, using a data warehouse.

Secondly, regular databases are subject to constant changes in the course of work of users, and the data warehouse is relatively stable: the data in it is usually updated according to a schedule (for example, weekly, daily, or hourly, depending on the needs). Ideally, the replenishment process is simply adding new data over a period of time without changing the old information already in storage.

And thirdly, conventional databases are most often the source of data that enters the warehouse. In addition, the repository can be replenished from external sources, such as statistical reports.

What is OLAP

Decision support systems usually have the means to provide the user with aggregate data for various samples from the initial set in a form convenient for perception and analysis. Typically, such aggregate functions form a multidimensional (and therefore non-relational) dataset (often referred to as a hypercube or metacube) whose axes contain the parameters and whose cells contain the aggregate data that depends on them. Along each axis, the data can be organized into a hierarchy representing different levels of detail. Thanks to this data model, users can formulate complex queries, generate reports, and receive subsets of data.

The technology of complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of data warehousing. The concept of OLAP was described in 1993 by Edgar Codd, a renowned database researcher and author of the relational data model (see EF Codd, SB Codd, and CTSalley, Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate. technical report, 1993). In 1995, based on the requirements outlined by Codd, the so-called FASMI test (Fast Analysis of Shared Multidimensional Information - fast analysis of shared multidimensional information) was formulated, which includes the following requirements for applications for multidimensional analysis:

  • providing the user with the analysis results in a reasonable time (usually no more than 5 s), even at the cost of a less detailed analysis;
  • the ability to carry out any logical and statistical analysis specific to this application and save it in a form accessible to the end user;
  • multi-user access to data with support for appropriate locking mechanisms and authorized access tools;
  • multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (this is a key requirement of OLAP);
  • the ability to access any necessary information, regardless of its volume and storage location.

It should be noted that OLAP functionality can be implemented in various ways, from the simplest data analysis tools in office applications to distributed analytical systems based on server products. But before talking about the various implementations of this functionality, let's look at what OLAP cubes are from a logical point of view.

Multidimensional cubes

In this section, we will take a closer look at the concept of OLAP and multidimensional cubes. As an example of a relational database that we will use to illustrate OLAP principles, we will use the Northwind database, which is included with Microsoft SQL Server or Microsoft Access, which is a typical database that stores information about the trading operations of a wholesale food supply company. Such data includes information about suppliers, customers, delivery companies, a list of supplied goods and their categories, data on orders and ordered goods, a list of company employees. Detailed description Northwind databases can be found in the Microsoft SQL Server or Microsoft Access help systems - we do not list it here due to lack of space.

To explore the concept of OLAP, let's use the Invoices view and the Products and Categories tables from the Northwind database, creating a query that will return the details of all ordered goods and issued invoices:

SELECT dbo.Invoices.Country, dbo.Invoices.City, dbo.Invoices.CustomerName, dbo.Invoices.Salesperson, dbo.Invoices.OrderDate, dbo.Categories.CategoryName, dbo.Invoices.ProductName, dbo.Invoices.ShipperName, dbo .Invoices.ExtendedPrice FROM dbo.Products INNER JOIN dbo.Categories ON dbo.Products.CategoryID = dbo.Categories.CategoryID INNER JOIN dbo.Invoices ON dbo.Products.ProductID = dbo.Invoices.ProductID

In Access 2000, a similar query looks like this:

SELECT Invoices.Country, Invoices.City, Invoices.Customers.CompanyName AS CustomerName, Invoices.Salesperson, Invoices.OrderDate, Categories.CategoryName, Invoices.ProductName, Invoices.Shippers.CompanyName AS ShipperName, Invoices.ExtendedPrice FROM Categories INNER JOIN (Invoices INNER JOIN Products ON Invoices.ProductID = Products.ProductID) ON Categories.CategoryID = Products.CategoryID;

This query accesses the Invoices view, which contains information about all invoices issued, as well as the Categories and Products tables, which contain information about the categories of products that were ordered and the products themselves, respectively. As a result of this request, we will receive a set of order data, including the category and name of the item ordered, the date the order was placed, the name of the person who issued the invoice, the city, country and name of the ordering company, as well as the name of the company responsible for delivery.

For convenience, let's save this request as a view, calling it Invoices1. The result of accessing this representation is shown in Fig. one .

What aggregate data can we get based on this view? Usually these are answers to questions like:

  • What is the total value of orders placed by customers in France?
  • What is the total value of orders placed by customers in France and delivered by Speedy Express?
  • What is the total value of orders placed by customers in France in 1997 and delivered by Speedy Express?

Let's translate these questions into SQL queries (Table 1).

The result of any of the above queries is a number. If you replace the 'France' parameter in the first query with 'Austria' or with the name of another country, you can run this query again and get a different number. After doing this procedure with all countries, we will get the following data set (the snippet is shown below):

Country SUM (ExtendedPrice)
Argentina 7327.3
Austria 110788.4
Belgium 28491.65
Brazil 97407.74
Canada 46190.1
Denmark 28392.32
Finland 15296.35
France 69185.48
Germany 209373.6

The resulting set of aggregate values ​​(in this case, sums) can be interpreted as a one-dimensional data set. The same data set can also be obtained as a result of a query with a GROUP BY clause of the following form:

SELECT Country, SUM (ExtendedPrice) FROM invoices1 GROUP BY Country

Now let's look at the second query above, which contains two conditions in the WHERE clause. If we execute this query, substituting all possible values ​​of the Country and ShipperName parameters into it, we will get a two-dimensional dataset of the following form (the snippet is shown below):

shippername
Country federal shipping Speedy Express United Package
Argentina 1 210.30 1 816.20 5 092.60
Austria 40 870.77 41 004.13 46 128.93
Belgium 11 393.30 4 717.56 17 713.99
Brazil 16 514.56 35 398.14 55 013.08
Canada 19 598.78 5 440.42 25 157.08
Denmark 18 295.30 6 573.97 7 791.74
Finland 4 889.84 5 966.21 7 954.00
France 28 737.23 21 140.18 31 480.90
Germany 53 474.88 94 847.12 81 962.58

Such a data set is called a pivot table or cross table (cross table, crosstab). Many spreadsheets and desktop DBMSs allow you to create such tables - from Paradox for DOS to Microsoft Excel 2000. For example, a similar query looks like this in Microsoft Access 2000:

TRANSFORM Sum(Invoices1.ExtendedPrice) AS SumOfExtendedPrice SELECT Invoices1.Country FROM Invoices1 GROUP BY Invoices1.Country PIVOT Invoices1.ShipperName;

Aggregate data for such a pivot table can also be obtained using the usual GROUP BY query:

SELECT Country,ShipperName, SUM (ExtendedPrice) FROM invoices1 GROUP BY COUNTRY,ShipperName Note, however, that the result of this query will not be the pivot table itself, but only a set of aggregate data for its construction (the fragment is shown below):

Country shippername SUM (ExtendedPrice)
Argentina federal shipping 845.5
Austria federal shipping 35696.78
Belgium federal shipping 8747.3
Brazil federal shipping 13998.26

The third of the above queries already has three parameters in the WHERE clause. By varying them, we will get a three-dimensional data set (Fig. 2).

Cells of the cube shown in fig. 2 contain aggregate data corresponding to the query parameter values ​​in the WHERE clause located on the cube axes.

You can get a set of two-dimensional tables by cutting a cube with planes parallel to its faces (they are denoted by the terms cross-sections and slices).

Obviously, the data contained in the cells of the cube can also be obtained using the corresponding query with the GROUP BY clause. In addition, some spreadsheets (in particular, Microsoft Excel 2000) also allow you to build a three-dimensional data set and view various sections of the cube, parallel to its edge, depicted on the sheet of the workbook (workbook).

If the WHERE clause contains four or more parameters, the resulting set of values ​​(also called an OLAP cube) can be 4-dimensional, 5-dimensional, and so on.

Having considered what multidimensional OLAP cubes are, let's move on to some key terms and concepts used in multidimensional data analysis.

Some terms and concepts

Along with the sums, the cells of the OLAP cube may contain the results of executing other aggregate functions of the SQL language, such as MIN, MAX, AVG, COUNT, and in some cases others (variance, standard deviation, etc.). To describe the data values ​​in the cells, the term summary is used (in the general case, there may be several of them in one cube), to indicate the initial data on the basis of which they are calculated, the term measure is used, and to indicate query parameters, the term dimension (translated into Russian usually as "dimension" when referring to OLAP cubes, and as "dimension" when referring to data warehouses). The values ​​plotted on the axes are called members of the dimensions.

Speaking of dimensions, it should be mentioned that the values ​​plotted on the axes can have different levels of detail. For example, we may be interested in the total cost of orders placed by customers in different countries, or the total cost of orders made by customers from other cities or even individual customers. Naturally, the resulting set of aggregate data in the second and third cases will be more detailed than in the first. Note that the possibility of obtaining aggregate data with varying degrees of detail corresponds to one of the requirements for data warehouses - the requirement for the availability of various data slices for comparison and analysis.

Since in the considered example, in the general case, each country can have several cities, and a city can have several customers, we can talk about hierarchies of values ​​in dimensions. In this case, countries are at the first level of the hierarchy, cities are at the second, and customers are at the third (Fig. 3).

Note that hierarchies can be balanced, as, for example, the hierarchy shown in Fig. 3 , as well as hierarchies based on date-time data and unbalanced ones. A typical example of an unbalanced hierarchy is a hierarchy of the “boss-subordinate” type (it can be built, for example, using the values ​​of the Salesperson field of the original data set from the example considered above), is shown in Fig. 4 .

Sometimes the term Parent-child hierarchy is used for such hierarchies.

There are also hierarchies that occupy an intermediate position between balanced and unbalanced (they are denoted by the term ragged - “uneven”). They usually contain members whose logical "parents" are not at the immediately higher level (for example, in the geographic hierarchy there are levels Country, City and State, but at the same time there are countries in the dataset that have no states or regions between the levels Country and City ; Fig. 5).

Note that unbalanced and "uneven" hierarchies are not supported by all OLAP tools. For example, Microsoft Analysis Services 2000 supports both types of hierarchy, while Microsoft OLAP Services 7.0 - only balanced. Different in different OLAP tools can be the number of hierarchy levels, and the maximum allowable number of members of one level, and the maximum possible number of dimensions themselves.

Conclusion

In this article, we got acquainted with the basics of OLAP. We learned the following:

  • The purpose of data warehouses is to provide users with information for statistical analysis and management decision-making.
  • Data warehouses should provide high speed of data acquisition, the ability to obtain and compare the so-called data slices, as well as the consistency, completeness and reliability of data.
  • OLAP (On-Line Analytical Processing) is a key component of building and using data warehouses. This technology is based on the construction of multidimensional data sets - OLAP cubes, the axes of which contain parameters, and the cells contain aggregate data that depends on them.
  • Applications with OLAP functionality must provide the user with analysis results in a reasonable time, perform logical and statistical analysis, support multi-user access to data, implement a multidimensional conceptual representation of data, and be able to access any necessary information.

In addition, we examined the basic principles of the logical organization of OLAP cubes, and also learned the basic terms and concepts used in multidimensional analysis. Finally, we learned what the different types of hierarchies are in OLAP cube dimensions.

In the next article of this series, we will look at the typical structure of data warehouses, talk about what constitutes client and server OLAP, and also dwell on some technical aspects of multidimensional data storage.

ComputerPress 4 "2001

The purpose of the course work is to study OLAP technologies, the concept of its implementation and structure.

In the modern world computer networks and computing systems allow you to analyze and process large amounts of data.

A large amount of information greatly complicates the search for solutions, but makes it possible to obtain much more accurate calculations and analysis. To solve this problem, there is a whole class of information systems that perform analysis. Such systems are called decision support systems (DSS) (DSS, Decision Support System).

To perform the analysis, the DSS must accumulate information, having the means of its input and storage. In total, there are three main tasks solved in the DSS:

· data input;

· data storage;

· data analysis.

Data entry into the DSS is carried out automatically from sensors characterizing the state of the environment or process, or by a human operator.

If data is entered automatically from sensors, then data is accumulated by a ready signal that occurs when information appears or by cyclic polling. If the input is done by a human, then they should provide users with convenient means for entering data that checks them for the correctness of the input, as well as perform the necessary calculations.

When entering data simultaneously by several operators, it is necessary to solve the problems of modification and parallel access of the same data.

DSS provides analysts with data in the form of reports, tables, graphs for study and analysis, which is why such systems provide decision support functions.

In data entry subsystems, called OLTP (On-linetransactionprocessing), operational data processing is implemented. For their implementation, conventional database management systems (DBMS) are used.

The analysis subsystem can be built on the basis of:

· subsystems of information retrieval analysis based on relational DBMS and static queries using the SQL language;

· operational analysis subsystems. To implement such subsystems, the technology of online analytical data processing OLAP is used, using the concept of multidimensional data representation;

· intellectual analysis subsystems. This subsystem implements DataMining methods and algorithms.

From the user's point of view, OLAP-systems provide a means of flexible viewing of information in various sections, automatic receipt of aggregated data, performing analytical operations of convolution, detailing, comparison over time. Thanks to all this, OLAP systems are a solution with great advantages in the field of data preparation for all types of business reporting, involving the presentation of data in various sections and different levels of hierarchy, such as sales reports, various forms of budgets, and others. OLAP systems have great advantages of such representation in other forms of data analysis, including forecasting.

1.2 Definition OLAP-systems

The technology of complex multidimensional data analysis is called OLAP. OLAP is a key component of a data warehouse organization.

OLAP functionality can be implemented in various ways, from the simplest, such as data analysis in office applications, to more complex, distributed analytical systems based on server products.

OLAP (On-LineAnalyticalProcessing) is a technology for online analytical data processing that uses tools and methods for collecting, storing and analyzing multidimensional data and in order to support decision-making processes.

The main purpose of OLAP systems is to support analytical activities, arbitrary requests from analyst users. The purpose of OLAP analysis is to test emerging hypotheses.

In 1993, the founder of the relational approach to building databases, Edgar Codd and partners (Edgar Codd, mathematician and IBM fellow), published an article initiated by the company "Arbor Software" (today it is the famous company "Hyperion Solutions"), entitled "Providing OLAP (operational analytical processing) for analyst users", which formulated 12 features of OLAP technology, which were subsequently supplemented by six more. These provisions have become the main content of a new and very promising technology.

Main features of technology OLAP (Basic):

  • multidimensional conceptual representation of data;
  • intuitive data manipulation;
  • availability and detail of data;
  • batch data extraction against interpretation;
  • OLAP analysis models;
  • client-server architecture (OLAP is accessible from the desktop);
  • transparency (transparent access to external data);
  • multi-user support.

Special Features(special):

  • processing of non-formalized data;
  • saving OLAP results: keeping them separate from the original data;
  • exclusion of missing values;
  • handling of missing values.

Features of reporting( Report ):

  • flexibility in reporting;
  • standard reporting performance;
  • auto tuning physical layer data extraction.

Measurement management( Dimension ):

  • universality of measurements;
  • unlimited number of dimensions and aggregation levels ;
  • unlimited number of operations between dimensions.

Historically, today the term "OLAP" implies not only a multidimensional view of data from the end user, but also a multidimensional representation of data in the target database. It is with this that the emergence as independent terms is connected. "Relational OLAP"(ROLAP) and "Multidimensional OLAP"(MOLAP).

OLAP-service is a tool for analyzing large amounts of data in real time. Interacting with the OLAP - system, the user will be able to carry out flexible viewing of information, obtain arbitrary data slices and perform analytical operations of detailing, convolution, end-to-end distribution, comparison over time in many parameters simultaneously. All work with the OLAP-system takes place in terms of the subject area and allows you to build statistically valid models of the business situation.

OLAP software - is an online data analysis tool contained in the repository. The main feature is that these tools are intended for use not by an information technology specialist, not by an expert statistician, but by a professional in the applied field of management - a manager of a department, department, management, and, finally, a director. Tools designed to communicate analyst with a problem, not with a computer. On fig. 6.14 shows an elementary OLAP cube that allows you to evaluate data in three dimensions.

A multidimensional OLAP-cube and a system of corresponding mathematical algorithms for statistical processing allows you to analyze data of any complexity at any time intervals.


Rice. 6.14.

Having at its disposal flexible mechanisms for data manipulation and visual display (Fig. Fig. 6.15, Fig. 6.16), the manager first considers data from different angles, which may (or may not) be related to the problem being solved.

Next, he compares various business indicators with each other, trying to identify hidden relationships; can take a closer look at the data, granulating it, for example, by breaking it down by time, by region, or by customer, or, conversely, further generalize the presentation of information to remove distracting details. After that, using the module statistical estimation and simulation several scenarios for the development of events are built, and the most acceptable option is selected from them.


Rice. 6.15.

A company manager, for example, may come up with a hypothesis that the spread of asset growth in various branches of the company depends on the ratio of specialists with technical and economic education in them. To test this hypothesis, the manager can request from the warehouse and display on the chart the ratio of interest to him for those branches whose asset growth for the current quarter decreased by more than 10% compared to last year, and for those whose assets increased by more than 25%. He should be able to use a simple selection from the menu provided. If the results obtained fall noticeably into two corresponding groups, then this should become a stimulus for further verification of the hypothesis put forward.

At present, the direction called dynamic simulation(Dynamic Simulation ), which fully implements the above FASMI principle.

Using dynamic modeling, the analyst builds a model of a business situation that develops over time, according to some scenario. At the same time, the result of such modeling can be several new business situations that generate a tree possible solutions with an assessment of the likelihood and prospects of each.


Rice. 6.16.

Table 6.3 shows the comparative characteristics of static and dynamic analysis.

Table 6.3.
Characteristic Static Analysis Dynamic analysis
Question Types Who? What? How? How? When? Where? Why is that? What would happen if...? What happens if…?
Response time Not regulated Seconds
Typical Data Operations Regulated report, diagram, table, figure Sequence of interactive reports, diagrams, screen forms. Dynamic change of aggregation levels and data slices
Level of analytical requirements Middle Tall
Display type Basically, predetermined, regulated User-defined, there are customization options
Data aggregation level Detailed and summary User Defined
"Age" of the data Historical and current Historical, current and forecast
Request types Mostly predictable. Unpredictable - from time to time
Purpose Scheduled analytical processing Multi-pass analysis, modeling and forecasting

Almost always, the task of building an analytical system for multivariate data analysis is the task of building unified, coherently functioning information system, based on heterogeneous software tools and decisions. And the very choice of means for the implementation of IP becomes an extremely difficult task. Many factors must be taken into account here, including the mutual compatibility of various software components, ease of their development, use and integration, efficiency of functioning, stability and even forms, level and potential prospects of relationships between various manufacturers.

OLAP is applicable wherever there is a task of analyzing multifactorial data. In general, if there is a table with data that has at least one descriptive column and one column with numbers, the OLAP tool will be an effective tool for analyzing and generating reports. As an example of the use of OLAP technology, consider the study of the results of the sales process.

Key questions "How much was sold?", "How much was sold?" expand as the business becomes more complex and historical data accumulates to a certain set of factors, or cuts: ".. in St. Petersburg, in Moscow, in the Urals, in Siberia ...", ".. in the last quarter, compared to the current one", " ..from supplier A versus supplier B…" etc.

Answers to such questions are necessary for making management decisions: changing the assortment, prices, closing and opening stores, branches, terminating and signing contracts with dealers, conducting or terminating advertising campaigns, etc.

If you try to highlight the main figures (facts) and cuts (measurement arguments) that the analyst manipulates, trying to expand or optimize the company's business, you will get a table suitable for sales analysis as a kind of template that requires appropriate adjustment for each specific enterprise.

Time. As a rule, these are several periods: Year, Quarter, Month, Decade, Week, Day. Many OLAP tools automatically calculate higher periods from a date and calculate totals for them.

Product category. There can be several categories, they differ for each type of business: Grade, Model, Type of packaging, etc. If only one product is sold or the assortment is very small, then the category is not needed.

Product. Sometimes the name of the product (or service), its code or article are used. In cases where the assortment is very large (and some enterprises have tens of thousands of items in their price list), the initial analysis for all types of goods may not be carried out, but generalized to some agreed categories.

Region. Depending on the global nature of the business, you can mean Continent, Group of countries, Country, Territory, City, District, Street, Part of the street. Of course, if there is only one outlet, then this dimension is missing.

Salesman. This measurement also depends on the structure and scale of the business. It can be: Branch, Store, Dealer, Sales Manager. In some cases, there is no measurement, for example, when the seller does not influence sales volumes, there is only one store, and so on.

Buyer. In some cases, such as retail, the customer is impersonal and there is no measurement, in other cases the customer information is there and important to sales. This dimension may contain the name of the company-buyer or many groupings and characteristics of customers: Industry, Group of enterprises, Owner, and so on. Analysis of the sales structure to identify the most important components in the context of interest. To do this, it is convenient to use, for example, a diagram of the "Pie" type in complex cases when 3 dimensions are examined at once - "Columns". For example, in the "Computer Technology" store for the quarter, sales of computers amounted to $100,000, photographic equipment - $10,000, consumables - $4,500. Conclusion: the turnover of the store depends to a large extent on the sale of computers (in fact, consumables may be necessary for the sale of computers, but this is an analysis of internal dependencies).

Dynamic analysis ( regression analysis- identifying trends). Identification of trends, seasonal fluctuations. Visually the dynamics is displayed by a graph of the "Line" type. For example, sales of Intel's products fell during the year, while Microsoft's sales grew. Perhaps the well-being of the average customer has improved, or the image of the store has changed, and with it the composition of customers. The range needs to be adjusted. Another example: for 3 years in winter, sales of video cameras are reduced.

Dependency Analysis(correlation analysis). Comparison of sales volumes of different goods over time to identify the required assortment - "baskets". It is also convenient to use a "Line" type chart for this. For example, when removing printers from the range during the first two months, a drop in sales of powder cartridges was found.

Share with friends or save for yourself:

Loading...