How to read a query plan in SQL Server and what to look out for. Query Plans made easy! How to analyze the execution plan of a query

History is as old as the world. Two tables:

Cities - 100 unique cities.
People - 10 million people. For some people, the city may not be specified.

The distribution of people across cities is even.
Indexes for the Cites.Id, Cites.Name, People .CityId fields are available.

You want to select the first 100 People records, sorted by Cites.

Rolling up our sleeves, we cheerfully write:

Select top 100 p.Name, c.Name as City from People p
order by c.Name

This will give us something like:

For… 6 seconds. (MS SQL 2008 R2, i5 / 4Gb)

But how is that! Why 6 seconds?! We know that in the first 100 entries there will be only Almaty! After all, there are 10 million records, which means there are 100 thousand per city. Even if this is not the case, we can choose the first city in the list and check if it has at least 100 inhabitants.

Why does SQL Server, having statistics, not do this:

Select * from People p
left join Cities c on c.Id=p.CityId
where p.CityId
in (select top 1 id from Cities order by Name)
order by c.

This request returns approximately 100k records in less than a second! We made sure that we had the required 100 records and gave them away very, very quickly.

However, MSSQL does everything according to plan. And he has a plan, “pure fusion” (c).

Question for the experts:
how it is necessary to fix the SQL query or do some actions on the server in order to get the result 10 times faster on the first query?

P.S.
CREATE TABLE . (

uniqueidentifier
ON
GO

CREATE TABLE . (
uniqueidentifier NOT NULL,
nvarchar(50) NOT NULL,
ON
GO

P.P.S.
Where do legs grow from?
The task is very real. There is a table with the main entity, many dimensions depart from it according to the “star” principle. The user needs to display it in the grid, providing sorting by fields.
Starting from a certain size of the main table, sorting comes down to selecting a window with the same (extreme) values (like "Almaty"), but the system starts to slow down terribly.
I would like to have ONE parameterized query that will work effectively with both a small People table and a large one.

P.P.P.S
Interestingly, if City were NotNull and InnerJoin was used, then the request is executed instantly.
It is interesting that EVEN IF the City field would be NotNull but LeftJoin was used, then the request slows down.

Idea in the comments: First select all InnerJoin and then Union by Null values. Tomorrow I will check this and other crazy ideas)

P.P.P.P.S I tried it. It worked!

WITH Help AS
select top 100 p.Name, c.Name as City from People p
INNER join Cities c on c.Id=p.CityId
order by c.Name ASC
UNION
select top 100 p.Name, NULL as City from People p
WHERE p.CityId IS NULL
SELECT TOP 100 * FROM help

Gives 150 milliseconds under the same conditions! Thanks

6 responses

There are several ways to obtain an execution plan, which one to use will depend on your circumstances. You can usually use SQL Server Management Studio to get the plan, however, if for some reason you can't run your query in SQL Server Management Studio, you might find it useful to get the plan through SQL Server Profiler or by checking the plan cache.

Method 1 - Using SQL Server Management Studio

There are some neat features in SQL Server that make collecting an execution plan easier, just make sure the "Include Actual Execution Plan" menu item (found in the "Query" menu) is ticked and will run yours as normal.

If you are trying to get the execution plan for statements in a stored procedure, you must execute the stored procedure, like so:

Exec p_Example 42

When your query is complete, you will see an additional "Execution Plan" tab appear in the results pane. If you have run many approvals, you may see many plans displayed on this tab.

Here you can check the execution plan in SQL Server Management Studio or right click on the plan and select "Save execution plan as..." to save the plan to an XML file.

Method 2 - Using the SHOWPLAN Options

This method is very similar to method 1 (it's actually what SQL Server Management Studio does internally), however I've included it for completeness or if you don't have SQL Server Management Studio available.

Before executing the query, run one following operators. The statement must be the only statement in the batch, i.e. You cannot execute another statement at the same time:

SET SHOWPLAN_TEXT ON SET SHOWPLAN_ALL ON SET SHOWPLAN_XML ON SET STATISTICS PROFILE ON SET STATISTICS XML ON -- The is the recommended option to use

These are connection options, so you only need to run this once for each connection. From now on, all running statements will be followed by additional set of results containing your execution plan in the format you want - just run your query as usual to see the plan.

Once you're done, you can disable this setting with the following statement:

SET<

Comparison of execution plan formats

If you have a strong preference, I recommend using the STATISTICS XML option. This option is equivalent to the "Include Actual Execution Plan" option in SQL Server Management Studio and provides the most information in the most convenient format.

SHOWPLAN_TEXT - Displays a text-based base estimated execution plan without executing the query
SHOWPLAN_ALL - Displays a text-based estimated execution plan with an estimated cost without executing the query
SHOWPLAN_XML - Displays an XML-based estimated execution plan with an estimated cost without executing the query. This is equivalent to the "Display Estimated Execution Plan..." option in SQL Server Management Studio.
STATISTICS PROFILE - Runs a query and displays the actual execution plan based on the text.
STATISTICS XML - Runs a query and displays the actual execution plan based on XML. This is equivalent to the "Include Actual Execution Plan" option in SQL Server Management Studio.

Method 3 - Using the SQL Server Profiler

If you can't run the query directly (or your query doesn't run slowly when you run it directly - remember we want the query plan to perform poorly) then you can fix the plan using SQL Server Profiler trace. The idea is to run your query while a trace that captures one of the "Showplan" events is running.

Please note that depending on the load you you can use this method in a production environment, however you should obviously be careful. SQL Server's profiling mechanisms are designed to minimize the impact on the database, but that doesn't mean there won't be a performance impact. You may also have trouble filtering and determining the correct plan in your trace if your database is under heavy usage. You obviously need to check with your DBA to make sure they are happy that you are doing this on your precious database!

Open SQL Server Profiler and create a new trace that connects to the database you want to capture the trace from.
On the "Event Selection" tab, check the "Show all events" checkbox, check the "Performance" → "Showplan XML" line and run the trace.
While tracing is running, do whatever you need to to run the slow query.
Wait for the request to complete and the trace to stop.
To save the trace, right click on the xml plan in the SQL Server profile and select "Retrieve Event Data..." to save the plan to an XML file.

The plan you get is equivalent to the "Include Actual Execution Plan" option in SQL Server Management Studio.

Method 4 - Checking the Query Cache

If you can't run your query directly, and you can't capture a profiler trace either, you can still get an estimated plan by checking the SQL query's cache plan.

We check the plan cache by querying SQL Server DMVs . Below is a basic query that will list all cached query plans (as xml) along with their SQL text. In most databases, you will also need to add additional filter conditions to filter the results down to the plans you are interested in.

SELECT UseCounts, Cacheobjtype, Objtype, TEXT, query_plan FROM sys.dm_exec_cached_plans CROSS APPLY sys.dm_exec_sql_text(plan_handle) CROSS APPLY sys.dm_exec_query_plan(plan_handle)

Run this query and click on the XML plan to open the plan in a new window - right click and select "Save Execution Plan As..." to save the plan to an XML file.

Notes:

Since there are so many factors (ranging from table and index schema to stored data and table statistics), you should always try to get the execution plan from the database you are interested in (usually the one that is experiencing a performance issue).

You cannot fix the execution plan for encrypted stored procedures.

"actual" and "estimated" execution plans

The actual execution plan is where SQL Server actually executes the query, whereas the estimated execution plan of SQL Server works on what it could do without executing the query. Although logically equivalent, the actual execution plan is much more useful because it contains additional data and statistics about what actually happened when the query was executed. This is important when diagnosing problems when SQL server evaluations are disabled (for example, when statistics are out of date).

How to interpret query execution plan?

It's a theme worthy enough for a free book.

In addition to the comprehensive answer already posted sometimes, it's useful to be able to access the execution plan programmatically to extract information. Sample code for this is below.

DECLARE @TraceID INT EXEC StartCapture @@SPID, @TraceID OUTPUT EXEC sp_help "sys.objects" /*<-- Call your stored proc of interest here.*/ EXEC StopCapture @TraceID

My favorite tool for getting and in-depth analysis of query execution plans is SQL Sentry Plan Explorer. It is much more convenient, convenient and complete for detailed analysis and visualization of execution plans than SSMS.

Here is a screen example for you to get an idea of what functionality the tool offers:

This is just one of the views available in the tool. Notice the set of tabs at the bottom of the application window, which allows you to get different types of execution plan views and useful additional information.

Also, I haven't noticed any restrictions in its free version that prevents it from being used on a daily basis or forces you to buy the Pro version eventually. So, if you'd rather stick with the free version, you're not barred from anything.

Apart from the methods described in the previous answers, you can also use the free execution plan viewer and query optimization tool ApexSQL Plan (which I've recently come across).

You can install and integrate ApexSQL plan in SQL Server Management Studio so execution plans can be viewed directly from SSMS.

Viewing Predictive Execution Plans in ApexSQL Plan

Click the button New request in SSMS and paste the query text into the query text box. Right-click and select "Show Estimated Execution Plan" from the context menu.

The Execution Plan chart will show the Execution Planning tab in the results section. Then right-click on the execution plan and select "Open in ApexSQL Plan" from the context menu.

The estimated execution plan will be opened in ApexSQL Plan and can be analyzed for query optimization.

Viewing Actual Execution Plans in ApexSQL Plan

To view the actual query execution plan, go to the second step mentioned earlier, but now, once the estimated plan appears, click the "Actual" button on the main ribbon bar in the ApexSQL Plan.

Clicking the Actual button will show the actual execution plan with a detailed preview of the cost parameters along with other execution plan details.

More information about viewing execution plans can be found by following this link.

Query plans can be retrieved from an Extended Events session via the query_post_execution_showplan event. Here is an example of an XEvent session:

/* Generated via "Query Detail Tracking" template. */ CREATE EVENT SESSION ON SERVER ADD EVENT sqlserver.query_post_execution_showplan(ACTION(package0.event_sequence,sqlserver.plan_handle,sqlserver.query_hash,sqlserver.query_plan_hash,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_frame,sqlserver.tsql_stack)), / * Remove any of the following events (or include additional events) as desired. */ ADD EVENT sqlserver.error_reported(ACTION(package0.event_sequence,sqlserver.client_app_name,sqlserver.database_id,sqlserver.plan_handle,sqlserver.query_hash,sqlserver.query_plan_hash,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_frame,sqlserver.tsql_stack) WHERE (.(.,(4)) AND .(.,(0)))), ADD EVENT sqlserver.module_end(SET collect_statement=(1) ACTION(package0.event_sequence,sqlserver.client_app_name,sqlserver.database_id,sqlserver. plan_handle,sqlserver.query_hash,sqlserver.query_plan_hash,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_frame,sqlserver.tsql_stack) WHERE (.(.,(4)) AND .(.,(0)))), ADD EVENT WHERE (.( .,(4)) AND .(.,(0)))), ADD EVENT sqlserver.sp_statement_completed(SET collect_object_name=(1) ACTION (package0.event_sequence,sqlserver.client_app_name,sqlserver.database_id,sqlserver.plan_handle,sqlserver.query_hash,sqlserver.query_plan_hash,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_frame,sqlserver.tsql_stack) WHERE (.(.,(4) ) AND .(.,(0)))), ADD EVENT sqlserver.sql_batch_completed(ACTION(package0.event_sequence,sqlserver.client_app_name,sqlserver.database_id,sqlserver.plan_handle,sqlserver.query_hash,sqlserver.query_plan_hash,sqlserver.session_id,sqlserver .sql_text,sqlserver.tsql_frame,sqlserver.tsql_stack) WHERE (.(.,(4)) AND .(.,(0)))), ADD EVENT sqlserver.sql_statement_completed(ACTION(package0.event_sequence,sqlserver.client_app_name,sqlserver .database_id,sqlserver.plan_handle,sqlserver.query_hash,sqlserver.query_plan_hash,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_frame,sqlserver.tsql_stack) WHERE (.(.,(4)) AND .(.,(0)) )) ADD TARGET package0.ring_buffer WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECON DS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF) GO

Once the session is created (in SSMS) go to Object Explorer and go to Manage | Extended Events | Sessions. Right click on the "GetExecutionPlan" session and run it. Right click it and select "Watch Live Data".

Then open a new query window and run one or more queries. Here's one for AdventureWorks:

USE AdventureWorks; GO SELECT p.Name AS ProductName, NonDiscountSales = (OrderQty * UnitPrice), Discounts = ((OrderQty * UnitPrice) * UnitPriceDiscount) FROM Production.Product AS p INNER JOIN Sales.SalesOrderDetail AS sod ON p.ProductID = sod.ProductID ORDER BY ProductName DESC; GO

After a minute or two, you should see some results in the "GetExecutionPlan: Live Data" tab. Select one of the query_post_execution_showplan events in the grid, and then click the Query Plan tab below the grid. It should look something like this:

EDIT: The XEvent code and screenshot was generated from SQL/SSMS 2012 w/SP2. If you are using SQL 2008/R2 you can set up a script to run it. But this version does not have a GUI, so you will need to extract the showplan XML file, save it as a *.sqlplan file, and open it in SSMS. It's cumbersome. XEvents did not exist in SQL 2005 or earlier. So, if you are not on SQL 2012 or later, I would strongly suggest one of the other answers posted here.

1 msdevcon.ru #msdevcon

3 Sergey Olontsev SQL Server MCM, MVP Kaspersky Lab

4 Structured Query Language

5 Sample query select pers.firstname, pers.lastname, emp.jobtitle, emp.nationalidnumber from HumanResources.Employee as emp inner join Person.Person as pers on pers.businessentityid = emp.businessentityid where pers.firstname = N"John" and emp.hiredate >= " "

6 Logical Query Tree Project pers.firstname, pers.lastname, emp.jobtitle, emp.nationalidnumber D A T A Filter Join pers.firstname = N"John" and emp.hiredate >= " " pers.businessentityid = emp.businessentityid Person.Person as pers Get Data Get Data HumanResources.Employee as emp

7 Query Plan Shows how a T-SQL query is executed at the physical layer.

8 Multiple ways

9 DEMO Simple plan Select all data from table how to get query plan

11 Methods of the Init() Operator The Init() method causes the physical operator to initialize itself and prepare all the necessary data structures. A physical operator can receive many Init() calls, although it usually only receives one. GetNext() The GetNext() method causes the physical operator to get the first or subsequent row of data. A physical operator may receive many GetNext() calls or none. The GetNext() method returns one row of data, and the number of calls to it is shown by the ActualRows value in the output of the Showplan statement. Close() When the Close() method is called, the physical operator performs some cleanup and closes. A physical operator receives only one Close() call.

12 Interaction between operators Operator 1 Operator 2 Operator 3

13 Interaction between operators 1. Request Row Operator 1 Operator 2 Operator 3

14 Interaction between operators 1. Request Row 2. Request Row Operator 1 Operator 2 Operator 3

15 Interaction between operators 1. Request Row 2. Request Row Operator 1 Operator 2 Operator 3 3. Send Row

16 Interaction between operators 1. Request Row 2. Request Row Operator 1 Operator 2 Operator 3 4. Send Row 3. Send Row

17 Interaction between operators 1. Request Row 2. Request Row Operator 1 Operator 2 Operator 3 4. Send Row 3. Send Row

18 DEMO The TOP operator Or why it is better to call the operator an iterator

19 Tables do not exist!

20 HoBT Page 1 Page 2 Page 3 Page 4 Row 1 Row 3 Row 5 Row 7 Row 2 Row 4 Row 6 Row 8

21 HoBT Page Page Page Page Page Page Page

22 DEMO Data Access Operators Scan, Seek, Lookup

23 Who has only one table in a database?

24 Nested Loops, Hash Join and Merge Join

25 Nested Loops inner join, left outer join, left semi join, left anti semi join Merge Join inner join, left outer join, left semi join, left anti semi join, right outer join, right semi join, right anti semi join , union Hash Join all types of logical operations

26 DEMO Nested Loops, Merge Join, Hash Join, Sort, First Operator

27 Warnings

28 DEMO Errors and Warnings in Query Plans

29 I know I don't know anything. Socrates

30 DEMO A small example of obscure

31 Query Plan Diagnostics -- TOP 10 queries that consume the most CPU and their plans select top(10) substring(t.text, qs.statement_start_offset / 2, case when qs.statement_end_offset = -1 then len(t.text) else (qs.statement_end_offset - qs.statement_start_offset) / 2 end), qs.execution_count, cast(qs.total_worker_time / as decimal(18, 2)) as total_worker_time_ms, cast(qs.total_worker_time * 1. / qs.execution_count / as decimal(18, 2)) as avg_worker_time_ms, cast(p.query_plan as xml) as query_plan from sys.dm_exec_query_stats as qs cross apply sys.dm_exec_sql_text(qs.sql_handle) as t cross apply sys.dm_exec_text_query_plan(qs.plan_handle, qs. statement_start_offset, qs.statement_end_offset) as p order by qs.total_worker_time desc; go

32 Technique for reading large query plans Try to break into logical blocks and analyze gradually. In SSMS, when the plan is graphically displayed, a button appears in the lower right corner for easier navigation through the query plan. You can use XQuery\XPath.

33 DEMO Large query plan

35 DEMO SQL Sentry Plan Explorer

36 Summary First Statement Optimization level Compile time Size in cache Parameters, Compile Values Reason for Early Termination Cost of Iterators Look at the highest cost statements first. Keep in mind that these are just guesses (even in actual execution plans).

37 Summing up Bookmark\Key Lookup If there are few of them, then most likely there is no problem. If there are a lot of them, creating a covering index will help get rid of them. Warnings Check why it occurs and take action if necessary.

38 Summing up Connections between statements (data flows) The thicker the connection, the more data passed between these statements. It is especially worth paying attention if at some stage the data flow increases dramatically. Table join order The smaller the data streams, the easier it is to join them. Therefore, first of all, you need to join those tables whose resulting data stream will be less.

39 Summarizing Scans Scans do not mean there is a problem. It is possible that there is not enough index on the table to do more efficient search. On the other hand, if you need to select all or most of the table, the scan will be more efficient. Search does not mean that all is well. A large number of lookups on non-clustered indexes can be a problem. Anything you don't know about the plan could potentially be a problem.

40 Questions

41 Contacts Sergey Olontsev Kaspersky Lab

42 2013 Microsoft Corporation. All rights reserved. Microsoft Windows Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Alexander Kuklin wrote an excellent article “Plan Cache and Query Parameterization. Part 1. Plan cache analysis. I recommend everyone to get acquainted.

Here is a small excerpt from it:

The query processor, which is responsible for executing SQL queries received by the SQL server and returning their results to the client, consists of two main components:

Query Optimizer.

Request Executor (Relational Engine).

Since the SELECT statement does not define the exact steps that SQL Server must take to give the client the requested data, SQL Server must parse the statement itself and determine the most efficient way to extract the requested data. First, the statement gets processed by the query optimizer, where the following steps are performed using optimizer components:

The Parser looks at the SELECT statement and breaks it down into logical units such as keywords, expressions, operators, and identifiers, and normalizes the query.

From the parser, the data enters the input of the Algebrizer component, which performs semantic analysis of the text. Algebrizer checks the existence of the database objects and their fields specified in the request, the correct use of operators and query expressions, and extracts literals from the request code to enable the use of automatic parameterization.
For example, this is why a query that has a section SELECT fields, which are not contained either in aggregate functions or in the GROUP BY section, will pass in SQL Server Management Studio (SSMS) check by Ctrl + F5 (parsing), but will fail with an error when trying to run by F5 (does not pass semantic analysis).

Next, Algebrizer builds a query parse tree describing the logical steps required to transform the original data to desired result. For the query tree, query object metadata is extracted (data types, index statistics, etc.), implicit type conversions are performed (if necessary), redundant operations (for example, unnecessary or redundant table joins) are removed.

The query optimizer then analyzes the various ways in which the source tables can be accessed. And chooses a set of steps that the optimizer thinks will return results the fastest and use the least amount of resources. The sequence of these derived steps is written into the query tree, and a query execution plan is generated from the final, optimized version of the tree.

Next, the resulting query execution plan is stored in the plan cache. And the query executor, based on the sequence of instructions (steps) specified in the execution plan, requests the required data from the storage subsystem, converts it into the format specified for the resulting data set, and returns it to the client.

Probably, every 1C nickname asked the question "which is faster, connection or condition in WHERE?" or, for example, "make a nested query or put the B () operator"?

After that, the 1C-nick goes to the forum, and there they tell him - you need to look at the query plan. He looks, and without understanding anything, forever throws the idea of optimizing queries through plans, continuing to compare options with a simple performance measurement.

As a result, on the developer's machine, the request just starts to fly, and then in the combat base, when the number of records increases, everything dies and complaints begin in the style of "1C slows down." A familiar picture, isn't it?

In this article, I will not give you exhaustive instructions for reading query plans. But I will try to explain intelligibly - what it is and from which side to approach them.

Moreover, I do not consider myself a good query optimizer, therefore, factual jambs are very likely in the article. Well, then let the gurus correct me in the comments. That's why we're here as a community to help each other, right?

If you already know how to read query plans, then you should probably skip the article. Here it will be the simplest and from the beginning. The article is aimed at those developers who have not yet figured out what kind of beast it is - a query plan.

How does a computer work

And I'll start from afar. The fact is that the computers we are used to, they are not so smart. You probably remember the first lessons of computer science, or the junior courses of the university? Remember sorting arrays with a bubble there, or reading a file line by line? So, nothing fundamentally new has been invented in modern relational DBMS.

If in the labs you read lines from a file and then wrote them to another place, then you already have a rough idea of how a modern DBMS works. Yes, of course, everything is much (very much) more complicated there, but - they are cycles in Africa, cycles, disk reading still has not become faster than RAM reading, and O (N) algorithms are still slower than O (1) algorithms as N increases.

Let's imagine that a person came to you, a simple 1C nickname, and said: "Look, buddy, you need to write a database. Here is a file, write some lines in it. And then read from there." Let's pretend you can't refuse. How would you solve this problem?

And would you solve it in the same way as the guys from Microsoft, Oracle, Postgres and 1C solve it. You would open the file in your programming language, read lines from there, and print them to the screen. The world has not come up with any fundamentally different algorithms from those that I have already described.

Imagine you have 2 files. One contains counterparties, and the other contains counterparty agreements. How would you implement an INNER JOIN operation? That's right on the forehead, without any optimizations?

Counterparties

Treaties

	Counterparty ID	Contract number

Let's now skip the nuances of opening files and reading into memory for simplicity. Let's focus on the join operation. How would you do it? I would do like this:

For Each StringContractor From Contractor Loop For Each StringContract From Contracts Cycle If StringContract.IDContractor = StringContractor.ID Then OutputConnectionResult(StringContractor, StringContract); EndIf; EndCycle; EndCycle;

AT f-i example PrintJoinResult will simply display all the columns from the given rows. Her code is not essential here.

So we see two nested loops. External on one table, and then in internal - search of a key from external by simple search. And now, all of a sudden, if you open a query plan with a JOIN in any of the 1C DBMS, then with a fairly high probability you will see the "Nested Loops" construct there. If we translate this from the language of a potential adversary into ours, we get "Nested loops". That is, in the "query plan" the DBMS explains to you that here, for the "connection", it applied the algorithm described above. This algorithm can be written by any schoolchild of about the 7th grade. And powerful world-class combat DBMSs use this algorithm quite calmly. For in some situations - he is the best there is at all.

And in general, why did I immediately start with the "connection". Let's assume that you just need to find a counterparty by name. How would you solve this problem? Here you have a file with contractors. Write an algorithm. I'll write it like this:

For Each StringContractor From Contractor Loop If StringContractor.Name = "Ivanov" Then DisplayResult(StringContractor); EndIf; EndCycle;

No, seriously, how else can you write it? But not in essence. If it is not known in what order the records in the table lie, then you will have to revise it all, whatever one may say. In query plan language, this is called Scan. Scanning. Full data view and nothing else.

Indices

But how can we speed up the search for data in the table? Well, the truth is, to review everything all the time is some kind of evil.

Recall the file cabinet in the clinic or library. How is the search performed by the name of the client? There are neat cards with letters from A to Z in the wooden lockers. And the patient "Pupkin" is in the locker with the "P" card. There is no need to look through all the other letters in a row. If we sort the data in our table and know where (under what line numbers) we have entries starting with the letter "P", then we will significantly approach the performance of the aunt from the registry. And that's better than brute force, isn't it?

So, the word "Index" in this context means (again, translated from the language of a potential enemy) "Table of Contents". To quickly find a chapter in a book, you go to the table of contents, find the title of the chapter there, then look at the page number and go straight to that page.

When the database needs to find a record in a table, it goes to the table of contents, looks at the name of the account, looks at the number of the record under which it lies, and goes to the desired area of the data file immediately after this record.

In code, it might look something like this:

Index = New Match; // blah blahRecordNumber = Index["Ivanov"] OutputResult(TableAccounts[RecordNumber]);

It is known that miracles do not happen, therefore, the memory under the Match "Index", as well as the search in the match itself, are not free operations. But they are much cheaper than a direct enumeration of all the data. Oh yes, this Compliance will have to be constantly kept up to date when adding or changing basic data.

Now let's think, how would you implement this index itself? You can store records in the data file immediately in sorted form. And everything would be fine, but, firstly, you need to search each time in different fields, and secondly, if the user wants to insert a record with the letter M into the table already filled from A to Z? But he will, I assure you.

Let's remember how writing to a file is generally done.

fsee(file, position); // jump to the desired address write(file, dataArray, dataLength); // write dataLength bytes from dataArray

If the position address points somewhere in the middle of the file, and there is data in this place, then they are overwritten with new ones. If you need to insert something into the middle of a file (including an array in memory), then you need to explicitly "move" everything that is after position, freeing up space, and only then write new data. As you understand, the "movement" of data is again cycles and input / output operations. I mean, not so fast. Nothing happens on the computer itself. All on command.

Let's get back to the index. The user wants to insert something in the middle. Whether you like it or not, you will have to move the data, or contrive to store the data in "pages" linked to each other in a list. Physically write to the end, or to an empty space, but as if in the middle of the table. And then update the record numbers in the table of contents. They have now moved and the index shows the wrong place. You've probably heard that database indexes speed up lookups, but slow down insertions and deletions. Now, you know why this is so.

Well, we have not yet solved the problem of searching in different fields. We cannot store data in a file in a different order. One user by name, and another, say - by date. And at the same time. How would you solve this problem? In my opinion, the solution is obvious - you need to store the data separately and the table of contents separately, sorted by the required fields. Those. the data is stored in the database as it should be, but side by side we will create a file where the records are sorted by name. This will be an index on the "Name" field. And next to it there will be another similar file, but sorted by the "Date" field. To save space, we will store in the indexes not all the columns of the main table, but only those by which sorting is performed (to quickly search here, find the record number and instantly jump to it to read the rest of the data).

The guys who write adult DBMS also did not come up with anything better. Indexes in a DB are arranged so. All data from the table is sorted side by side in a separate entity. Basically, an index is just another table. And it takes up space in proportion to the size of the main table, which is logical. Yes, there are still various tricks, such as balanced trees and all that, but the meaning does not change much.

By the way, if you write data to the main table immediately ordered, then you can not make a separately stored index and consider the data table itself as an index. It's great, right? Such an index is called "clustered". It is logical that the field by which the records in the table are sorted should try to grow monotonously. You remember the insert in the middle, right?

Query Execution Scheduling

Imagine that you have a table with five million records. And she has an index. We need to quickly find a record with the word "Hello." Also imagine that you have the same table, but with three records. And you also need to find "Hi". Which search method to choose? Open the index file, run through it with a binary search, find the desired record number, open the main table file, jump to the record by its number, read it? Or run a loop from one to three, checking each entry against a condition? Modern computer cycles from one to three performs just awful, how fast.

To make a decision, the query planner must understand what it is dealing with. He operates with such a thing as statistics. Statistics include the number of records in tables, the distribution of data in columns, selectivity, and so on and so forth. These are all hints to the planner about which way to collect data will be faster. Not the fastest possible, but at least fast enough with some probability. And the planner has limited time to make a decision. we want to quickly get the data, and not wait until he plans himself there.

Here, I would not take up the work of writing a planner without first defending a dissertation. How he works there and how he manages to do it quite tolerably - I don’t know. Therefore, we confine ourselves to the DBMS documentation. It follows from it that, based on the statistics, the scheduler builds several options step-by-step execution of the request, and then selects the most suitable one from them. For example, the first available. It's also a heuristic, isn't it?

“What should I do first,” the planner thinks: “traverse the entire table A, selecting records by condition, and then join with table B with nested loops, or find all the matching records of table B by index, and only then run through table A”? Each of the steps has a certain weight or cost. The higher the cost, the more difficult it is to perform. The query plan always writes the cost of each of the steps that the DBMS engine took to collect the results of the query.

Plan operator device

Each step of the query plan is implemented as some object, the input of which is one set of records, and the output is another set. If you put it in code, then it turns out that the query plan operator is an implementation of an abstract interface with a single method:

Interface IQueryOperator ( DataRow GetNextRow(); )

For those who do not understand what is written here, I will explain. Each query plan statement has a "GiveNextEntry" method. The DBMS engine pulls the operator for this method and, with each such pull, adds the resulting record to the query result. For example, the record filtering operator has the entire table as input, and only those that satisfy the condition as output. Further, the output of this operator is fed to the operator, for example, FIRST 100, and then to the aggregation operator (SUM or QUANTITY), which, in the same way, encapsulate all processing inside, and output a record with the result.

Schematically it looks like this:

AllData ->Filter(Name="Petrov")->Top(100)->Aggregation(NUMBER)

When you open the query plan, you will see cubes connected by arrows. Cubes are operators. Arrows - the direction of data flows. The data runs along the arrows from one operator to another, merging at the end into the result of the query.

Each operator has certain parameters: the number of records processed, the cost, the number of I / O operations, the use of caches, and so on and so forth. All this allows us to judge the effectiveness of the request. A table scan that runs through a million records and produces two in the output is not a very good query plan. But the planner didn't find anything better. It didn't have an index to look into. Or maybe the statistics lied and said that there were three records in the table, but in fact they managed to write a million pieces there, but the statistics were not updated. All of this is subject to investigation by the engineer who studies the request.

The query plan is a query step-by-step debugger. You step by step look at what exactly, what algorithm (literally) the DBMS applied to give the result. You have seen examples of the algorithms themselves - they are extremely complex, because there are cycles and conditions. Even sometimes several cycles are nested, that's horror. It is important to understand what processes take place inside each operator. What algorithm was applied to the array of records during execution and how long it worked.

I plan to consider specific operators found in query plans and their internal structure in the next article. Thank you for reading to the end!