SAS, NAS, SAN: a step towards storage networks. Data storage systems (SDS) Data storage system

What is the purpose of systems data storage(ShD)?

Data storage systems are designed for safe and fault-tolerant storage of processed data with the ability to quickly restore access to data in the event of a system failure.

What are the main types of storage systems?

By type of implementation, storage systems are divided into hardware and software. According to the field of application, storage systems are divided into individual, for small workgroups, for workgroups, for enterprises, corporate. According to the type of connection, storage systems are divided into:

1. DAS (Direct Attached Storage - direct attached systems)

feature of this type systems is that control over access to data for devices connected to the network is carried out by the server or workstation to which the storage is connected.

2. NAS (Network Attached Storage - systems connected to a LAN)

In this type of system, access to information located in the repository is controlled by software that runs in the repository itself.

3.SAN (Storage Attached Network) - systems that are a network between servers that process data and, in fact, storage systems);

With this method of building a data storage system, control over access to information is carried out by software running on storage servers. Via SAN switches, the storage is connected to servers using high-performance access protocols (Fibre channel, iSCSI, ATA over ethernet, etc.)

What are the features of software and hardware implementation of storage systems?

The hardware implementation of the storage system is a single hardware complex consisting of a storage device (which is a disk or an array of disks on which data is physically stored) and a control device (a controller that distributes data between storage elements).

The software implementation of the storage system is a distributed system in which data is stored without being tied to any particular storage or server, and data is accessed through specialized software that is responsible for the safety and security of stored data).

In the simplest case, a SAN consists of storage systems, switches and servers connected by optical communication channels. In addition to direct disk storage systems in the SAN, you can connect disk libraries, tape libraries (streamers), devices for storing data on optical disks (CD / DVD and others), etc.

An example of a highly available infrastructure in which servers are connected simultaneously to a local network (left) and a storage area network (right). Such a scheme provides access to data located on the storage system in case of failure of any processor module, switch, or access path.

Using SAN allows you to provide:

centralized resource management of servers and data storage systems ;
connection of new disk arrays and servers without stopping the operation of the entire storage system;
use of previously purchased equipment in conjunction with new storage devices;
quick and reliable access to data drives located at a great distance from servers, * without significant performance losses;
speeding up the process of backup and data recovery - BURA.

History

The development of networking technologies has led to the emergence of two networking solutions for storage - storage networks Storage Area Network (SAN) for data exchange at the block level supported by client file systems, and servers for storage at the file level Network Attached Storage (NAS). To distinguish traditional storage systems from network storage systems, another retronym was proposed - Direct Attached Storage (DAS).

DAS, SAN and NAS successively appearing on the market reflect the evolving chains of relationships between applications that use data and the bytes on the media containing this data. Once upon a time, application programs themselves read and wrote blocks, then drivers appeared as part of operating system. In modern DAS, SAN and NAS, the chain consists of three links: the first link is the creation of RAID arrays, the second is the processing of metadata that allows interpreting binary data in the form of files and records, and the third is services for providing data to the application. They differ in where and how these links are implemented. In the case of DAS, storage is "bare", it only provides the ability to store and access data, and everything else is done on the server side, starting with interfaces and drivers. With the advent of SAN, RAID provision is transferred to the storage side, everything else remains the same as in the case of DAS. And NAS is different in that metadata is also transferred to the storage system to provide file access, here the client only needs to support data services.

The emergence of SAN became possible after the Fiber Channel (FC) protocol was developed in 1988 and approved by ANSI as a standard in 1994. The term Storage Area Network dates back to 1999. Over time, FC gave way to Ethernet, and IP-SAN networks with iSCSI connection became widespread.

The idea of a network storage server NAS belongs to Brian Randall of Newcastle University and was implemented in machines on a UNIX server in 1983. This idea was so successful that it was picked up by a variety of companies, including Novell, IBM, and Sun, but eventually changed leaders from NetApp and EMC.

In 1995, Garth Gibson developed the principles of NAS and created object storage systems (Object Storage, OBS). He began by dividing all disk operations into two groups, one containing the more frequent operations, such as reads and writes, and the other, the more infrequent ones, such as naming operations. Then he proposed another container in addition to blocks and files, he called it an object.

OBS is distinguished by a new type of interface, it is called object. Data client services interact with metadata via the Object API. In addition to storing data, OBS also supports RAID, stores metadata related to objects, and supports an object interface. DAS and SAN and NAS and OBS coexist over time, but each type of access is more suited to a particular type of data and application.

SAN architecture

Network topology

SAN is a high-speed data network designed to connect servers to storage devices. A variety of SAN topologies (point-to-point, Arbitrated Loop, and switching) replace traditional server-to-storage bus connections and provide greater flexibility, performance, and reliability than they do. The concept of SAN is based on the ability to connect any of the servers to any storage device operating on the Fiber Channel protocol. The principle of interaction of nodes in a SAN with point-to-point topologies or switching is shown in the figures. In an Arbitrated Loop SAN, data transfer occurs sequentially from node to node. In order to start data transmission, the transmitting device initiates arbitration for the right to use the data transmission medium (hence the name of the topology - Arbitrated Loop).

The transport basis of the SAN is the Fiber Channel protocol, which uses both copper and fiber-optic device connections.

SAN Components

SAN components are divided into the following:

Data storage resources;
Devices that implement the SAN infrastructure;

Host Bus Adapters

Storage resources

Storage resources include disk arrays, tape drives, and Fiber Channel libraries. Storage resources realize many of their capabilities only when they are included in the SAN. So high-end disk arrays can replicate data between arrays over Fiber Channel networks, and tape libraries can transfer data to tape directly from Fiber Channel disk arrays, bypassing the network and servers (Serverless backup). The most popular on the market are disk arrays from EMC, Hitachi, IBM, Compaq (the Storage Works family inherited by Compaq from Digital), and among the tape library manufacturers we should mention StorageTek, Quantum/ATL, IBM.

Devices Implementing SAN Infrastructure

The devices that implement the SAN infrastructure are Fiber Channel switches (Fibre Channel switches , FC switches), hubs (Fibre Channel Hub) and routers (Fibre Channel-SCSI routers). Hubs are used to combine devices operating in the Fiber Channel Arbitrated Loop (FC_AL ). The use of hubs allows you to connect and disconnect devices in a loop without stopping the system, since the hub automatically closes the loop if a device is disconnected and automatically opens the loop if a new device is connected to it. Each loop change is accompanied by a complex process of its initialization. The initialization process is multi-stage, and before it is completed, data exchange in the loop is not possible.

All modern SANs are built on switches that allow you to implement a full-fledged network connection. Switches can not only connect Fiber Channel devices, but also restrict access between devices, for which so-called zones are created on the switches. Devices placed in different zones cannot communicate with each other. The number of ports in a SAN can be increased by connecting switches to each other. A group of connected switches is called Fiber Channel Fabric or simply Fabric. The links between switches are called Interswitch Links or ISL for short.

Software

The software allows you to implement redundancy of server access paths to disk arrays and dynamic load distribution between paths. For most disk arrays, there is an easy way to determine if ports accessed through different controllers are for the same disk. Specialized software maintains a table of access paths to devices and ensures that paths are disabled in the event of an accident, dynamic connection of new paths and load balancing between them. As a rule, manufacturers of disk arrays offer specialized software of this type for their arrays. VERITAS Software produces VERITAS Volume Manager software, designed to organize logical disk volumes from physical disks and provide redundancy of disk access paths, as well as load balancing between them for most known disk arrays.

Protocols Used

SANs use low-level protocols:

Fiber Channel Protocol (FCP), SCSI transport over Fiber Channel. The most commonly used protocol at the moment. Available in 1 Gbit/s, 2 Gbit/s, 4 Gbit/s, 8 Gbit/s and 10 Gbit/s.
iSCSI, SCSI transport over TCP/IP.
FCoE, FCP/SCSI transport over pure Ethernet.
FCIP and iFCP , FCP/SCSI encapsulation and transmission in IP packets.
HyperSCSI, SCSI over Ethernet transport.
FICON transport over Fiber Channel (only used by mainframes).
ATA over Ethernet, ATA over Ethernet transport.
SCSI and/or TCP/IP transport via InfiniBand (IB).

Advantages

High reliability of access to data located on external storage systems. Independence of the SAN topology from the used storage systems and servers.
Centralized data storage (reliability, security).
Convenient centralized management of switching and data.
Transferring intensive I / O traffic to a separate network - offloading the LAN.
High performance and low latency.
Scalability and Flexibility of the SAN Logical Design
Geographical dimensions of SAN, unlike classical DAS, are practically unlimited.
The ability to quickly distribute resources between servers.
The ability to build fault-tolerant cluster solutions at no additional cost based on the existing SAN.
simple circuit Reserve copy- all data is in one place.
Availability of additional features and services (snapshots, remote replication).
High security SAN.

Sharing storage systems typically simplifies administration and adds a fair amount of flexibility because cables and disk arrays do not need to be physically transported and rewired from one server to another.

Another advantage is the ability to boot servers directly from the storage network. With this configuration, you can quickly and easily replace a failed

As you know, in recent years there has been an intensive increase in the volume of accumulated information and data. A study by IDC Digital Universe has shown that the world's digital content can grow from 4.4 zettebytes to 44 zettebytes by 2020. According to experts, the volume of digital information doubles every two years. Therefore, today the problem of not only information processing, but also its storage is extremely relevant.

To address this issue, there is currently a very active development of such a direction as the development of storage systems (networks / data storage systems). Let's try to figure out what exactly the modern IT industry means by the term "data storage system".

Storage is a software and hardware integrated solution aimed at organizing reliable and high-quality storage of various information resources, as well as providing uninterrupted access to these resources.

The creation of such a complex should help in solving a variety of tasks facing modern business in the course of building an integral information system.

The main components of the storage system:

Storage devices (tape library, internal or external disk array);

Monitoring and control system;

Data backup/archiving subsystem;

Storage management software;

Access infrastructure to all storage devices.

Main tasks

Consider the most typical tasks:

decentralization of information. Some organizations have a developed branch structure. Each individual unit of such an organization should have free access to all the information it needs to work. Modern storage systems interact with users who are located at a great distance from the center where data processing is performed, therefore they are able to solve this problem.

Inability to foresee the final required resources. During project planning, it can be extremely difficult to determine exactly how much information you will have to work with during the operation of the system. In addition, the amount of accumulated data is constantly increasing. Most modern storage systems support scalability (the ability to increase its performance after adding resources), so the system power can be increased in proportion to the increase in loads (upgrade).

Security of all stored information. It can be quite difficult to control and restrict access to information resources of an enterprise. Unskilled actions of service personnel and users, deliberate attempts to sabotage - all this can cause significant harm to stored data. Modern storage systems use various fault tolerance schemes that allow them to resist both deliberate sabotage and inept actions of unskilled employees, thereby maintaining the system's performance.

The complexity of managing distributed information flows - any action aimed at changing distributed information data in one of the branches inevitably creates a number of problems - from the complexity of synchronizing different databases and versions of developer files to unnecessary duplication of information. Software products management, supplied with the storage system, will help you optimally simplify and effectively optimize the work with stored information.

High costs. According to a study conducted by IDC Perspectives, data storage costs account for about twenty-three percent of all IT spending. These costs include the cost of software and hardware components of the complex, payments to service personnel, etc. The use of storage systems saves on system administration, and also reduces personnel costs.

Main types of storage systems

All data storage systems are divided into 2 types: tape and disk storage systems. Each of the above two species is divided, in turn, into several subspecies.

Disk storage

Such storage systems are used to create backup intermediate copies, as well as operational work with various data.

Disk storage systems are divided into the following subspecies:

Devices for backups(various disk libraries);

Devices for working data (equipment characterized by high performance);

Devices used for long-term storage of archives.

Tape storage

Used to create archives and backups.

Tape storage systems are divided into the following subspecies:

Tape libraries (two or more drives, many tape slots);

Autoloaders (1 drive, multiple slots dedicated to tapes);

Separate drives.

Main connection interfaces

Above, we examined the main types of systems, and now let's take a closer look at the structure of the storage systems themselves. Modern storage systems are classified according to the type of host interfaces they use. Consider below the 2 most common external connection interfaces - SCSI and FibreChannel. The SCSI interface resembles the widely used IDE and is a parallel interface that allows up to sixteen devices to be placed on one bus (for IDE, as you know, two devices per channel). The maximum speed of the SCSI protocol today is 320 megabytes per second (a version that will provide a speed of 640 megabytes per second is currently in development). The disadvantages of SCSI are as follows - inconvenient, not having noise immunity, too thick cables, the maximum length of which does not exceed twenty-five meters. The SCSI protocol itself also imposes certain restrictions - as a rule, this is 1 initiator on the bus plus slave devices (streamers, disks, etc.).

The FibreChannel interface is less commonly used than the SCSI interface because the hardware used for this interface is more expensive. In addition, FibreChannel is used to deploy large SAN storage networks, so it is used only in large companies. Distances can be practically anything - from standard three hundred meters on typical equipment to two thousand kilometers for powerful switches ("directors"). The main advantage of the FibreChannel interface is the ability to combine many storage devices and hosts (servers) into a common SAN storage area network. Less important advantages are: greater distances than with SCSI, the possibility of link aggregation and redundant access paths, the possibility of "hot plugging" equipment, higher noise immunity. Two-core single- and multi-mode optical cables (with SC or LC type connectors) are used, as well as SFP - optical transmitters made on the basis of laser or LED emitters (the maximum distance between the devices used, as well as the transmission speed, depend on these components).

Storage topology options

Traditionally, storage is used to connect servers to a DAS - a data storage system. In addition to DAS, there are also NAS - storage devices that connect to the network, as well as SAN - components of storage networks. SAN and NAS systems were created as an alternative to the DAS architecture. In addition, each of the above solutions was developed as a response to the ever-increasing requirements for modern storage systems and was based on the use of technologies available at that time.

The architectures of the first network storage systems were developed in the 1990s to address the most tangible shortcomings of DAS systems. Storage networking solutions were designed to achieve the above goals: reduce the cost and complexity of data management, reduce LAN traffic, improve overall performance and data availability. At the same time, SAN and NAS architectures solve different aspects of the same common problem. As a result, 2 network architectures began to exist simultaneously. Each of them has its own functionality and benefits.

DAS

(D direct A tached S torage)- this is an architectural solution used in cases where a device used for storing digital data is connected via the SAS protocol through an interface directly to a server or workstation.

The main advantages of DAS systems are: low cost compared to other storage solutions, ease of deployment and administration, high-speed data exchange between the server and the storage system.

The above advantages have allowed DAS systems to become extremely popular in the segment of small corporate networks, hosting providers and small offices. But at the same time, DAS systems also have their drawbacks, for example, not optimal resource utilization, explained by the fact that each DAS system requires the connection of a dedicated server, in addition, each such system allows you to connect no more than two servers to a disk shelf in a certain configuration.

Advantages:

Affordable cost. The storage system is essentially a disk basket installed outside the server, equipped with hard drives.

Ensuring high-speed exchange between the server and the disk array.

Disadvantages:

Insufficient reliability - in the event of an accident or any network problems, the servers are no longer available to a number of users.

High latency due to the fact that all requests are processed by one server.

Lack of manageability - having the entire capacity available to a single server reduces the flexibility of data distribution.

Low Resource Utilization – The amount of data required is difficult to predict: some DAS devices in an organization may experience excess capacity, while others may lack it, as reallocation of capacity is usually too laborious or not possible at all.

NAS

(N work A tached S torage) is an integrated stand-alone disk system, which includes a NAS server with its own specialized operating system and a set of user-friendly functions that provide quick launch system, as well as access to any files. The system is connected to an ordinary computer network, allowing users of this network to solve the problem of lack of free disk space.

NAS is storage that connects to the network like a regular network device, providing file access to digital data. Any NAS device is a combination of a storage system and a server to which this system is connected. The simplest version of a NAS device is a network server that provides file shares.

NAS devices consist of a head unit that performs data processing and also connects a chain of disks into a single network. NAS provide the use of storage systems in Ethernet networks. Sharing access to files is organized in them using the TCP / IP protocol. These devices allow file sharing even among clients running different operating systems. Unlike DAS architecture, in NAS systems, servers can not be taken offline to increase overall capacity; You can add disks to the NAS structure by simply connecting the device to the network.

NAS technology is developing today as an alternative to universal servers that carry a large number of different functions ( Email, fax server, applications, printing, etc.). NAS devices, unlike universal servers, perform only one function - a file server, trying to do it as quickly, simply and efficiently as possible.

Connecting a NAS to a LAN provides access to digital information to an unlimited number of heterogeneous clients (that is, clients with different operating systems) or other servers. Today, almost all NAS devices are used in Ethernet networks based on TCP / IP protocols. Access to NAS devices is carried out through the use of special access protocols. The most common file access protocols are DAFS, NFS, CIFS. Specialized operating systems are installed inside such servers.

A NAS device can look like a simple "box" equipped with one Ethernet port and a couple of hard drives, or it can be a huge system equipped with several specialized servers, a huge number of drives, and external Ethernet ports. Sometimes NAS devices are part of a SAN. In this case, they do not have their own drives, but only provide access to the data that is located on block devices. In this case, the NAS acts as a powerful specialized server, and the SAN acts as a storage device. In this case, a single DAS topology is formed from SAN and NAS components.

Advantages

Low cost, availability of resources for individual servers, as well as for any computer in the organization.

Versatility (one server is able to serve Unix, Novell, MS, Mac clients).

Ease of deployment as well as administration.

Ease of sharing resources.

disadvantages

Accessing information through network file system protocols is often slower than accessing a local disk.

Most affordable NAS servers are not able to provide the flexible, high-speed access method that modern SAN systems provide (at the block level, not files).

SAN

(S torage A rea N work)- this architectural solution allows you to connect external storage devices (tape libraries, disk arrays, optical drives, etc.) to servers. With this connection, external devices are recognized by the operating system as local. Using a SAN network reduces the total cost of maintaining a storage system and allows modern organizations to organize reliable storage of their information.

The simplest SAN option is storage systems, servers and switches connected by optical communication channels. In addition to disk storage systems, disk libraries, streamers (tape libraries), devices used to store information on optical disks, etc. can be connected to the SAN.

Advantages

Reliability of access to those data that are on external systems.

Independence of SAN topology from used servers and data storage systems.

Security and reliability of centralized data storage.

Convenient centralized data management and switching.

The ability to move I / O traffic to a separate network, providing LAN offload.

Low latency and high performance.

Flexibility and scalability of the SAN logical structure.

The actual unlimited geographical size of the SAN.

Ability to quickly distribute resources between servers.

The simplicity of the backup scheme, ensured by the fact that all data is located in one place.

The ability to create failover cluster solutions based on an existing SAN at no additional cost.

Availability of additional services and features, such as remote replication, snapshots, etc.

High Security SAN/

The only drawback of such solutions is their high cost. In general, the domestic market for data storage systems lags behind the market of developed Western countries, which is characterized by the widespread use of storage systems. The high cost and lack of high-speed communication channels are the main reasons hindering the development of the Russian storage market.

RAID

Speaking about data storage systems, one should definitely consider one of the main technologies that underlie the operation of such systems and are widely used in the modern IT industry. We mean RAID arrays.

A RAID array consists of several disks that are controlled by a controller and interconnected via high-speed data transfer channels. Such disks (memory devices) are perceived by an external system as a whole. The type of array used directly affects the degree of performance and fault tolerance. RAID arrays are used to increase the reliability of data storage, as well as to increase the speed of writing / reading.

There are several levels of RAID used when creating SANs. The most commonly used levels are:

1. This is a disk array of increased performance, without fault tolerance, with striping.
The information is divided into separate blocks of data. It is recorded simultaneously on two or more discs.

Pros:

The amount of memory is summed up.

Significant increase in performance (the number of disks directly affects the performance increase).

Minuses:

The reliability of RAID 0 is lower than that of even the most unreliable disk, because if any of the disks fails, the entire array becomes inoperable.

2. - disk mirror array. This array consists of a pair of disks that completely copy each other.

Pros:

Ensuring an acceptable write speed when paralleling queries, as well as a gain in read speed.

Ensuring high reliability - a disk array of this type functions until at least 1 disk is working in it. The probability of failure of 2 disks at the same time, which is equal to the product of the probabilities of failure of each of them, is much lower than the probability of failure of one disk. In practice, if one disk fails, it is necessary to take immediate action, restoring redundancy again. To do this, it is recommended to use hot spare disks with RAID of any level (except for zero).

Minuses:

The disadvantage of RAID 1 is that the user gets one HDD for the price of two discs.

3. . This is a RAID 0 array built from RAID 1 arrays.

4. RAID 2. Used for arrays using Hamming code.

Arrays of this type are based on the use of the Hamming code. Disks are divided into 2 groups: for data, and also for codes used for error correction. The data on the disks used for storing information is distributed similarly to the distribution in RAID 0, that is, it is divided into small blocks in accordance with the number of disks. The remaining disks store all error correction codes that help restore information in case one of the hard disks fails. The Hamming method used in ECC memory makes it possible to correct single errors on the fly, as well as detect double errors.

RAID 3, RAID 4. These are disk arrays with striping, as well as a dedicated parity disk. In RAID 3, data from n disks is split into sub-sector components (either blocks or bytes) and then distributed across n-1 disks. Parity blocks are stored on one disk. In a RAID 2 array, n-1 disks were used for this purpose, however, most of the information on the control disks was used to correct errors on the fly, while for most users, in the event of a disk failure, a simple recovery of information is enough (for this, information that fits on one hard disk is enough ).

A RAID 4 array is similar to RAID 3, however, the data on it is not divided into individual bytes, but into blocks. This partly made it possible to solve the problem of an insufficiently high data transfer rate with a small volume. This write is too slow due to the fact that the write generates parity for the block, writing to a single disk.
Unlike RAID 2, RAID 3 differs by the inability to correct errors on the fly, as well as by less redundancy.

Pros:

Cloud providers also actively purchase storage systems for their needs, for example, Facebook and Google build their own servers from ready-made components to order, but these servers are not included in the IDC report.

IDC also expects that emerging markets will soon overtake developed markets in terms of storage consumption, as they are characterized by higher economic growth rates. For example, the region of Eastern and Central Europe, Africa and the Middle East in 2014 in terms of spending on storage systems will surpass Japan. By 2015, the Asia-Pacific region, excluding Japan, will surpass Western Europe in storage consumption.

The sale of data storage systems carried out by our company "Navigator" gives everyone the opportunity to get a reliable and durable basis for storing their multimedia data. A wide selection of Raid arrays, network storages and other systems makes it possible to individually select for each order RAID from the second to the fourth is the impossibility of parallel write operations, due to the fact that a separate parity disk is used to store digital parity information. RAID 5 does not have the above disadvantage. Checksums and data blocks are written automatically to all disks, there is no asymmetry in disk configuration. Checksums mean the result of the XOR operation. XOR makes it possible to replace any operand with the result and, using the XOR algorithm, get the missing operand as a result. To store the XOR result, only one disk is needed (its size is identical to the size of any disk in the raid).

Pros:

The popularity of RAID5 is primarily due to its cost-effectiveness. Writes to a RAID5 volume consume additional resources, resulting in performance degradation as additional computations are required as well as writes. But when reading (compared to a single hard disk), there is a certain advantage, which consists in the fact that data streams coming from several disks can be processed in parallel.

Minuses:

RAID 5 has much slower performance, especially when performing random write operations (such as Random Write), which degrade performance by 10-25 percent of RAID 10 or RAID 0. This is because this process more disk operations are required (each server write operation on the RAID controller is replaced by 3 operations - 1 read operation and 2 write operations). The disadvantages of RAID 5 appear when one disk fails - in this case, the entire volume goes into critical mode, all read and write operations are accompanied by additional manipulations, which leads to a sharp drop in performance. In this case, the reliability level drops to the reliability level of RAID 0 equipped with the corresponding number of disks, becoming n times less than the reliability of a single disk. In the event that at least one more drive fails before the array is restored, or an unrecoverable error occurs on it, the array will be destroyed, and the data on it cannot be restored using conventional methods. Also be aware that the process of rebuilding with redundant RAID data, called RAID Reconstruction, after a drive fails, will cause an intense continuous read load from all drives that will persist for many hours. As a result, one of the remaining drives may fail. Also, previously undetected failures of reading data in cold data arrays (those data that are not accessed during normal operation of the array - inactive and archived) may be revealed, which leads to an increased risk of failure during data recovery.

6. is a RAID 50 array, which is built from RAID5 arrays;

7. - a striped disk array that uses 2 checksums calculated in 2 independent ways.

RAID 6 is in many ways similar to RAID 5, but differs from it in a higher degree of reliability: it allocates the capacity of two disks for checksums, two sums are calculated using different algorithms. A higher capacity RAID controller is required. Helps protect against multiple failure by ensuring operability after two drives fail at the same time. Arraying requires a minimum of four drives. Using RAID-6 typically results in a disk group performance degradation of approximately 10-15 percent. This is due to the large amount of information that the controller has to process (there is a need to calculate the second checksum, as well as reading and rewriting a larger number of disk blocks in the process of writing each of the blocks).

8. is a RAID 0 array that is built from RAID6 arrays.

9. Hybrid RAID. This is another level of RAID that has become quite popular lately. These are normal RAID levels used with additional software and SSDs used as read cache. This leads to an increase in system performance, due to the fact that SSDs, in comparison with HDDs, have much better speed characteristics. Today there are several implementations, for example, Crucial Adrenaline, as well as several budget Adaptec controllers. Currently, the use of Hybrid RAID is not recommended due to the low resource of SSD drives.

Reads in Hybrid RAID are performed on the faster SSD, while writes are performed on both SSDs and HDDs (this is done for redundancy purposes).
Hybrid RAID is great for applications that use low-level data (virtual computer, file server, or internet gateway).

Features of the modern storage market

Analytical company IDC in the summer of 2013 published its next forecast for the storage market, calculated by it until 2017. Analysts' calculations show that in the next four years, global enterprises will purchase storage systems, the total capacity of which will be one hundred and thirty-eight exabytes. Total realizable storage capacity will increase by about 30% annually.

However, compared to previous years, when there was a rapid growth in data storage consumption, the pace of this growth will slow down somewhat, as today most companies use cloud solutions, giving preference to technologies that optimize data storage. Storage space savings are achieved using tools such as virtualization, data compression, data deduplication, etc. All of the above tools provide space savings, allowing companies to avoid spontaneous purchases and resort to purchasing new storage systems only when they are really needed.

Of the 138 exabytes expected to be sold in 2017, 102 exabytes will be external storage and 36 exabytes internal. In 2012, twenty exabytes of storage was implemented for external systems and eight for internal ones. Financial costs for industrial storage systems will increase by approximately 4.1 percent annually and by 2017 will amount to about forty-two and a half billion dollars.

We have already noted that the global storage market, which has recently experienced a real boom, gradually began to decline. In 2005, the growth in storage consumption at the industrial level was sixty-five percent, and in 2006 and 2007 - fifty-nine percent each. In subsequent years, the growth in storage consumption has further declined due to the negative impact of the global economic crisis.

Analysts predict that the growth in the use of cloud storage will lead to a decrease in the consumption of storage solutions at the enterprise level. Cloud providers also actively purchase storage systems for their needs, for example, Facebook and Google build their own servers from ready-made components to order, but these servers are not included in the IDC report.

Prompt sale of storage systems

Wide technical capabilities, literacy and experience of the company's personnel guarantee a quick and comprehensive implementation of the task. At the same time, we are not limited solely to the sale of storage systems, since we also perform its configuration, launch and subsequent service and maintenance.

This article will focus on entry-level and mid-range storage systems, as well as those trends that stand out clearly in this industry today. For convenience, we will call data storage systems drives.

First, we will briefly dwell on the terminology and technological foundations of stand-alone drives, and then we will move on to novelties and a discussion of modern achievements in various technology and marketing groups. We will also be sure to talk about why systems of one kind or another are needed and how effective their use is in different situations.

Autonomous disk subsystems

In order to better understand the features of autonomous drives, let's dwell a little on one of the simpler technologies for building data storage systems - bus-oriented technology. It provides for the use of an enclosure for disk drives and a PCI RAID controller.

Figure 1. Bus-based technology for building data storage systems

Thus, between the disks and the host PCI bus (from the English. Host- in this case, a stand-alone computer, such as a server or workstation) there is only one controller, which largely determines the speed of the system. Drives built according to this principle are the most productive. But due to architectural features, their practical use, except in rare cases, is limited to single-host configurations.

The disadvantages of a bus-oriented storage architecture include:

efficient use only in single host configurations;
dependence on the operating system and platform;
limited scalability;
limited possibilities for organizing fault-tolerant systems.

Naturally, all this does not matter if the data is needed for one server or workstation. On the contrary, in this configuration you will get maximum performance for minimum money. But if you need storage for a large data center, or even for two servers that need the same data, a bus-oriented architecture is completely out of the question. The disadvantages of this architecture are avoided by the architecture of autonomous disk subsystems. The basic principle of its construction is quite simple. The controller that manages the system is transferred from the host computer to the drive enclosure, allowing operation independent of the host systems. It should be noted that such a system may have a large number of external input/output channels, which makes it possible to connect several or even many computers to the system.

Figure 2. Offline storage system

Any intelligent data storage system consists of hardware and software code. In an autonomous system, there is always a memory that stores the program of algorithms for the operation of the system itself and the processor elements that process this code. Such a system functions regardless of which host systems it is connected to. Due to their intelligence, stand-alone drives often implement many data preservation and management functions on their own. One of the most important basic and almost universally used features is RAID (Redundant Array of Independent Disks). The other, which already belongs to mid-range and high-level systems, is virtualization. It provides features such as instant copy or remote backup, as well as other rather sophisticated algorithms.

Briefly about SAS, NAS, SAN

As part of the consideration of autonomous storage systems, it is imperative to dwell on how host systems access drives. This largely determines the scope of their use and internal architecture.

There are three main options for organizing access to drives:

SAS (Server Attached Storage) - drive connected to the server [other name DAS (Direct Attached Storage) - directly attached drive];
NAS (Network Attached Storage) - a drive connected to a network;
SAN (Storage Area Network) - storage area network.

We already wrote about SAS / DAS, NAS and SAN technologies in an article dedicated to SAN, if anyone is interested in this information, we recommend that you refer to the iXBT pages. But still, let us refresh the material a little with an emphasis on practical use.

SAS/DAS- this is a fairly simple traditional connection method, which involves a direct (hence DAS) connection of a storage system to one or more host systems via a high-speed channel interface. Often in these systems, the drive is connected to the host using the same interface used to access the internal disks of the host system, which generally provides high performance and easy connection.

A SAS system can be recommended for use if there is a need for high-speed processing of large volumes of data on one or more host systems. This, for example, can be a file server, a graphics station, or a failover cluster system consisting of two nodes.

Figure 3. Cluster system with shared storage

NAS- a drive that is connected to a network and provides file (note - file, not block) access to data for host systems on a LAN / WAN network. Clients that work with the NAS usually use the NSF (Network File System) or CIFS (Common Internet File System) protocols to access data. The NAS interprets file protocol commands and executes a request to disk drives in accordance with the channel protocol used in it. In fact, the NAS architecture is an evolution of file servers. The main advantage of this solution is the speed of deployment and the quality of the organization of access to files, due to specialization and narrow focus.

Based on the foregoing, a NAS can be recommended for use if you need network access to files and quite important factors are: the simplicity of the solution (which is usually a kind of quality guarantee) and the ease of its maintenance and installation. An excellent example is the use of a NAS as a file server in a small company office where ease of installation and administration is important. But at the same time, if you need access to files from a large number of host systems, a powerful NAS drive, thanks to a polished specialized solution, is able to provide intensive traffic exchange with a huge pool of servers and workstations at a fairly low cost of the communication infrastructure used (for example , switches gigabit ethernet and copper twisted pair).

SAN- data storage network. Typically, SANs use block data access, although it is possible to connect storage networks to devices that provide file services, such as NAS. In modern implementations of storage networks, the Fiber Channel protocol is most often used, but in the general case this is not mandatory, and therefore, it is customary to single out a separate class of Fiber Channel SAN (Fiber Channel-based storage networks).

The basis of the SAN is a network separate from the LAN / WAN, which serves to organize access to data from servers and workstations directly involved in processing. This structure makes building high-availability, high-demand systems a relatively simple task. Although SANs remain expensive today, the TCO (total cost of ownership) for medium to large systems built using SAN technology is quite low. A description of how to reduce the TCO of enterprise storage systems thanks to SANs can be found on the techTarget resource pages: http://searchstorage.techtarget.com .

Today, the cost of Fiber Channel-enabled disk drives, as the most common interface for building a SAN, is close to the cost of systems with traditional low-cost channel interfaces (such as parallel SCSI). The main cost component in SAN remains the communication infrastructure, as well as the cost of its deployment and maintenance. In this connection, within the framework of SNIA and many commercial organizations, active work is underway on IP Storage technologies, which makes it possible to use much more inexpensive equipment and IP network infrastructure, as well as the enormous experience of specialists in this field.

Examples for efficient use SAN can be cited quite a lot. Almost anywhere where there is a need to use multiple servers with shared storage, you can use a SAN. For example, to organize collective work on video data or pre-processing of printed products. In such a network, each participant in the process of processing digital content gets the opportunity to work on Terabytes of data almost simultaneously. Or, for example, organizing the backup of large amounts of data that are used by many servers. When building a SAN and using a LAN / WAN-independent data backup algorithm and “snapshot” technologies, you can backup almost any amount of information without compromising the functionality and performance of the entire information complex.

Fiber Channel in SANs

It is an undoubted fact that today it is FC (Fiber Channel) that dominates storage networks. And it was the development of this interface that led to the development of the SAN concept itself.

The design of FC involved specialists with significant experience in the development of both channel and network interfaces, and they managed to combine all the important positive features of both directions. One of the most important advantages of Fiber Channel, along with speed parameters (which, by the way, are not always the main ones for SAN users, and can be implemented using other technologies) is the ability to work over long distances and topology flexibility, which came into the new standard from network technologies. . Thus, the concept of building a SAN topology is based on the same principles as traditional local networks, based on hubs, switches and routers, which greatly simplifies the construction of multi-node system configurations, including those without a single point of failure.

It is also worth noting that within Fiber Channel, both fiber optic and copper media are used for data transmission. When organizing access to geographically remote nodes at a distance of up to 10 kilometers, standard equipment and single-mode optical fiber are used for signal transmission. If the nodes are separated by 10 or even 100 kilometers, special amplifiers are used. When building such SANs, parameters that are quite unconventional for data storage systems are taken into account, for example, the speed of signal propagation in an optical fiber.

Storage Trends

The world of storage systems is extremely diverse. The capabilities of data storage systems and the cost of solutions are quite differentiated. There are solutions that combine the ability to serve hundreds of thousands of requests per second to tens and even hundreds of Terabytes of data, as well as solutions for a single computer with inexpensive IDE drives.

IDE RAID

Recently, the maximum volume of drives with an IDE interface has increased tremendously and is about two times ahead of SCSI drives, and if we talk about the ratio of price per unit volume, IDE drives are leading by more than 6 times. This, unfortunately, did not positively affect the reliability of IDE drives, but still, their scope in stand-alone data storage systems is inexorably increasing. The main factor in this process is that the demand for large amounts of data is growing faster than the volume of single disks.

A few years ago, rare manufacturers decided to release stand-alone subsystems oriented to the use of IDE drives. Today they are produced by almost every manufacturer focused on the market of entry-level systems. The greatest distribution in the class of autonomous subsystems with IDE drives is observed in entry-level NAS systems. After all, if you use the NAS as a file server with an interface fast ethernet or even Gigabit Ethernet, then in most cases the speed of such drives is more than sufficient, and their low reliability is compensated by the use of RAID technology.

Where block access to data is required at a minimum price per unit of stored information, systems with IDE disks inside and with an external SCSI interface are actively used today. For example, on the JetStor IDE system manufactured by the American company AC & NC for building a fault-tolerant archive with a volume of stored data of 10 terabytes and the ability to quickly block access to data, the cost of storing one megabyte will be less than 0.3 cents.

Another interesting and rather original technology, which I had to get acquainted with quite recently, was the Raidsonic SR-2000 system with an external parallel IDE interface.

Figure 4. Standalone entry-level IDE RAID

This is a stand-alone disk system designed to use two IDE disks and is designed to be mounted inside the host system case. It is completely independent of the operating system on the host machine. The system allows you to organize RAID 1 (mirror) or just copy data from one disk to another with the possibility of hot-swap disks, without any damage or inconvenience on the part of the computer user, which cannot be said about bus-oriented subsystems built on PCI IDE RAID controllers .

It should be noted that the leading manufacturers of IDE drives have announced the release of medium-class drives with Serial ATA interface, which will use high-level technologies. This should have a positive effect on their reliability and increase the share of ATA solutions in data storage systems.

What Serial ATA will bring us

The first and most pleasant thing about Serial ATA is the cable. Due to the fact that the ATA interface has become serial, the cable has become round, and the connector has become narrow. If you've ever run parallel IDE cables in an eight-channel IDE system, I'm sure you'll love this feature. Of course, round IDE cables have existed for a long time, but their connector still remained wide and flat, and the maximum allowable length of a parallel ATA cable is not encouraging. When building systems with a large number of disks, the presence of a standard cable does not help much at all, since the cables have to be made independently, and at the same time their laying becomes almost the most time-consuming task during assembly.

In addition to the peculiarities of the cable system, Serial ATA has other innovations that cannot be implemented independently for the parallel version of the interface using a clerical knife and other handy tools. Drives with the new interface should soon support the Native Command Queuing instruction set (command pipelining). When using Native Command Queuing, the Serial ATA controller analyzes I/O requests and optimizes the order of their execution in such a way as to minimize lookup time. The Serial ATA Native Command Queuing idea is quite similar to SCSI command queuing, however, Serial ATA will support a queue of up to 32 commands, rather than the traditional 256 commands for SCSI. Native support for hot swap devices has also appeared. Of course, such a possibility existed before, but its implementation was outside the scope of the standard and, accordingly, could not be widely adopted. Speaking about the new high-speed capabilities of Serial ATA, it should be noted that now there is no great joy from them, but the main thing here is that there is a good Roadmap for the future, which would be very difficult to implement within the framework of parallel ATA.

Given the above, there is no doubt that the share of ATA solutions in entry-level storage systems should increase precisely due to new Serial ATA drives and storage systems focused on the use of such devices.

Where does parallel SCSI go

Anyone who works with storage systems, even entry-level ones, can hardly say that they like systems with IDE drives. The main advantage of ATA drives is their low price compared to SCSI devices and, probably, lower noise level. And all this happens for a simple reason, since the SCSI interface is better suited for use in data storage systems and is still much cheaper than the even more functional interface - Fiber Channel, then disks with a SCSI interface are produced of better quality, more reliable and faster, than with a cheap IDE interface.

Many vendors today are using Ultra 320 SCSI, the most new interface in the family. Once upon a time, many Roadmaps had plans to release devices with an Ultra 640 and even Ultra 1280 SCSI interface, but everything went to the point that something in the interface needed to be radically changed. Parallel SCSI already now, at the stage of using Ultra 320, does not suit many, mainly due to the inconvenience of using classic cables.

Fortunately, a new Serial Attached SCSI (SAS) interface has recently appeared. The new standard will have interesting features. It combines some of the features of Serial ATA and Fiber Channel. Despite this strangeness, it should be said that there is some common sense in such an interweaving. The standard originated from the physical and electrical specifications of serial ATA, with improvements such as higher signal levels to match cable lengths, and increased maximum device addressability. And the most interesting thing is that the technologists promise to ensure the compatibility of Serial ATA and SAS devices, but only in the next versions of the standards.

The most important features of SAS include:

point-to-point interface;
two-channel interface;
support for 4096 devices in the domain;
standard SCSI command set;
cable up to 10 meters long;
4-core cable;
full duplex.

Due to the fact that the new interface offers the same miniature connector as Serial ATA, developers have a new opportunity to build more compact devices with high performance. The SAS standard also provides for the use of extenders. Each expander will support 64 device addressing with the ability to cascade up to 4096 devices within a domain. This is, of course, significantly less than the capabilities of Fiber Channel, but for entry-level and mid-level storage systems, with drives directly connected to the server, this is quite enough.

Despite all the charms, the Serial Attached SCSI interface is unlikely to quickly replace the conventional parallel interface. In the world of enterprise solutions, development tends to be more thorough, and naturally longer, than for the desktop. Yes, and old technologies do not go away very quickly, since the period for which they work out themselves is also rather big. But still, in the year 2004, devices with an SAS interface should enter the market. Naturally, at first it will be mainly disks and PCI controllers, but after another year, data storage systems will also catch up.

For a better generalization of information, we suggest that you familiarize yourself with a comparison of modern and new interfaces for data storage systems in the form of a table.

1 - The standard regulates distances up to 10 km for single-mode fiber, there are implementations of devices for data transmission over distances greater than 105 m.
2 - Hubs and some FC switches work within the internal virtual ring topology, and there are many switch implementations that provide point-to-point connection of any devices connected to them.
3 - There are implementations of devices with SCSI, FICON, ESCON, TCP/I, HIPPI, VI protocols.
4 - The fact is that the devices will be mutually compatible (this is what manufacturers promise to do in the near future). That is, SATA controllers will support SAS drives, and SAS controllers will support SATA drives.

Mass craze for NAS

Recently, there has been a mass enthusiasm for NAS-drives abroad. The fact is that with the increasing relevance of a data-oriented approach to building information systems the attractiveness of the specialization of classic file servers has increased and the formation of a new marketing unit - NAS. At the same time, the experience in building such systems was sufficient for a quick start in the technology of storage devices connected to the network, and the cost of their hardware implementation was extremely low. Today, NAS drives are produced by virtually all manufacturers of storage systems, including entry-level systems for very little money, and medium-sized ones, and even systems responsible for storing tens of Terabytes of information that can process a huge number of requests. Each class of NAS-systems has its own interesting original solutions.

PC based NAS in 30 minutes

We want to describe a little one original entry-level solution. One can argue about the practical value of its implementation, but one cannot refuse originality.

In fact, an entry-level NAS drive, and not only an entry-level one, is quite simple personal computer with a certain number of disks and a software part that provides other network participants with access to data at the file level. Thus, to build a NAS device, it is enough to take these components and connect them together. It's all about how well you do it, the same reliable and high-quality access to data will get working group, which works with data accessed by your device. It is taking these factors into account, as well as the deployment time of the solution, plus some design research, that an entry-level NAS drive is being built.

The difference between a good entry-level NAS solution with a personal computer self-assembled and configured within the selected OS, if we again omit the design, will be:

how fast can you do it;
how easy this system can be serviced by unskilled personnel;
how well this solution will work and be supported.

In other words, in the case of a professional selection of components and the existence of some initially configured set of software, a good result can be achieved. The truth seems to be banal, the same can be said about any task that is solved according to the scheme of ready-made component solutions: “hardware” plus “software”.

What does Company X propose to do? A rather limited list of compatible components is being formed: motherboards with all the integrated facilities needed by an entry-level NAS server hard drives. You buy an IDE-installable connector on motherboard FLASH disk with the recorded software and you get a ready-made NAS drive. The operating system and utilities written to this disk, when loaded, configure the necessary modules in an adequate way. And as a result, the user gets a device that can be controlled both locally and remotely via an HTML interface and provide access to disk drives connected to it.

File protocols in modern NAS

CIFS (Common Internet File System) is a standard protocol that provides access to files and services on remote computers (including the Internet). The protocol uses a client-server interaction model. The client makes a request to the server to access files or send a message to a program on the server. The server fulfills the client's request and returns the result of its work. CIFS is an open standard that arose on the basis of the SMB protocol (Server Message Block Protocol) developed by Microsoft, but, unlike the latter, CIFS takes into account the possibility of large timeouts, as it is also oriented to use in distributed networks. The SMB protocol has traditionally been used in local networks with Windows OS for file access and printing. CIFS uses the TCP/IP protocol to transport data. CIFS provides functionality similar to FTP ( File Transfer Protocol), but provides clients with improved (direct-like) control over files. It also allows you to share file access between clients by blocking and automatically reconnecting to the server in case of a network failure.

NFS (Network File System) is an IETF standard that includes a distributed file system and a network protocol. NFS was developed by Sun Microsystem Computer Corporation. It was originally used only on UNIX systems, later implementations of client and server chat became common in other systems.

NFS, like CIFS, uses a client-server interaction model. It provides access to files on a remote computer (server) for writing and reading as if they were on the user's computer. Early versions of NFS used UDP to transport data, while modern versions use TCP/IP. For NFS to work on the Internet, Sun developed the WebNFS protocol, which uses extensions to the NFS functionality to work correctly on the World Wide Web.

DAFS (Direct Access File System) is a standard file access protocol based on NFSv4. It allows application tasks to transfer data, bypassing the operating system and its buffer space, directly to transport resources, preserving the semantics inherent in file systems. DAFS takes advantage of the latest memory-to-memory data transfer technologies. Its use provides high file I/O speeds, minimal CPU and system load, due to a significant reduction in the number of operations and interrupts that are usually needed when processing network protocols. Particularly effective is the use of hardware support for VI (Virtual Interface).

DAFS was designed to be used in a cluster and server environment for databases and a variety of Internet applications focused on continuous operation. It provides the lowest access delays to file shares and data, and also supports intelligent system and data recovery mechanisms, which makes it very attractive for use in high-end NAS drives.

All roads lead to IP Storage

There have been a lot of exciting new technologies in high- and mid-range storage systems over the past few years.

Fiber Channel storage area networks are already a fairly well-known and popular technology today. At the same time, their mass distribution today is problematic due to a number of features. These include the high cost of implementation and the complexity of building geographically distributed systems. On the one hand, these are just features of enterprise-level technology, but on the other hand, if the SAN becomes cheaper and the construction of distributed systems becomes simpler, this should give a huge breakthrough in the development of storage networks.

As part of the work on network storage technologies in the Internet Engineering Task Force (IETF), a working group and an IP Storage (IPS) forum were created in the following areas:

FCIP - Fiber Channel over TCP/IP, a tunneling protocol based on TCP/IP, whose function is to connect geographically distant FC SANs without any impact on FC and IP protocols.

iFCP is an Internet Fiber Channel Protocol, a TCP/IP-based protocol for connecting FC storage systems or FC storage networks using IP infrastructure in conjunction with or instead of FC switching and routing elements.

iSNS - Internet Storage Name Service, a protocol for maintaining storage names on the Internet.

iSCSI - Internet Small Computer Systems Interface, is a protocol that is based on TCP/IP and is designed to establish communication and control of storage systems, servers and clients (SNIA Definition - IP Storage Forum: ).

The most rapidly developing and most interesting of these areas is iSCSI.

iSCSI - the new standard

On February 11, 2003, iSCSI became the official standard. The ratification of iSCSI will certainly affect the wider interest in the standard, which is already developing quite actively. The iSCSI development will most likely be the impetus for the spread of SANs in small and medium-sized businesses, since the use of standard-compliant equipment and a service approach (including those common within standard Ethernet networks) will make storage area networks much cheaper. As for the use of iSCSI on the Internet, today FCIP has already taken root well here, and competition with it will be difficult.

The new standard was readily supported by well-known IT companies. Of course, there are opponents, but still, almost all companies that are actively involved in the market for entry-level and mid-range systems are already working on iSCSI-enabled devices. iSCSI drivers are already included in Windows and Linux, iSCSI storage systems are manufactured by IBM, adapters are manufactured by Intel, HP, Dell, EMC promise to join the process of mastering the new standard in the near future.

One of the very interesting features of iSCSI is that you can use not only carriers, switches and routers of existing LAN / WAN networks, but also ordinary network adapters Fast Ethernet or Gigabit Ethernet on the client side. True, there is a significant overhead of the processor power of the PC that uses such an adapter. According to the developers, the software implementation of iSCSI can reach the speeds of the Gigabit Ethernet data transfer medium with a significant, up to 100% load of modern CPUs. In this connection, it is recommended to use special network cards that will support mechanisms for offloading the CPU from processing the TCP stack.

Virtualization in SANs

Another important technology in building modern drives and storage networks is virtualization.

Storage virtualization is the representation of physical resources in a logical, more convenient way. This technology allows you to flexibly distribute resources between users and manage them efficiently. Within the framework of virtualization, remote copying, snapshot, distribution of I / O requests to the most suitable drives in terms of the nature of the service, and many other algorithms are successfully implemented. The implementation of virtualization algorithms can be carried out both by means of the drive itself, and with the help of external virtualization devices, or with the help of control servers running specialized software under standard operating systems.

This, of course, is a very small part of what can be said about virtualization. This topic is very interesting and extensive, so we decided to devote a separate publication to it.