Synthesis and recognition of speech. Modern solutions. Computer sound equipment. Converting sound into a stream of numbers. Audio compression: principle and setting Factors that limit the dynamic range

The encoding technology used in DVD players with their own

audio decoders and receivers. Dynamic range compression (or reduction) is used to limit audio peaks when watching movies. If the viewer wishes to watch a film in which abrupt changes in volume level are possible (a film about war,

for example) but does not want to disturb their family members, then DRC should be turned on. Subjectively, by ear, after turning on DRC, the proportion of low frequencies in the sound decreases and high sounds lose transparency, so you should not turn on the DRC mode unless necessary.

DreamWeaver (See - front page)

A visual editor of hypertext documents developed by the software company Macromedia Inc. Powerful professional program DreamWeaver contains the ability to generate HTML pages of any complexity and scale, and also has built-in tools to support large network projects. It is a visual design tool that supports advanced WYSIWYG concept tools.

Driver (See Driver)

A software component that allows you to interact with devices

computer, such as Network Card(NIC), keyboard, printer or monitor. Network equipment (such as a hub) connected to a PC requires drivers in order for the PC to communicate with the equipment.

DRM (Digital Rights Management - Management of access and copying of information protected by copyright, Digital Rights Management)

u A concept that involves the use of special technologies and methods for protecting digital materials to ensure that they are provided only to authorized users.

v A client program for interacting with the Digital Rights Management Services package, which is designed to control access to and copy of copyrighted information. DRM Services works in the environment Windows Server 2003. The client software will run on Windows 98, Me, 2000, and XP, allowing applications such as Office 2003 to access the appropriate services. In the future, Microsoft should release a digital rights management module for Internet browser explorer. In the future, it is planned to have such a program on the computer to work with any content that uses DRM technologies in order to protect against illegal copying.

Droid (Robot) (See Agent)

DSA(Digital Signature Algorithm - Digital Signature Algorithm)

Public key digital signature algorithm. Developed by NIST (USA) in 1991

DSL (Digital Subscrabe Line)

State of the art technology supported by the city telephone exchanges to exchange signals for more high frequencies ah, compared to those used in conventional, analog modems. A DSL modem can work simultaneously with a telephone (analogue signal) and a digital line. Since the spectra of the voice signal from the phone and the digital DSL signal do not "intersect", i.e. do not affect each other, DSL allows you to surf the Internet and talk on the phone on the same physical line. What's more, DSL technology typically uses multiple frequencies, and DSL modems on both sides of the line try to pick the best ones for data transmission. The DSL modem not only transmits data, but also acts as a router. Equipped with an Ethernet port, the DSL modem makes it possible to connect several computers to it.

DSOM(Distributed System Object Model, Distributed SOM - Distributed System Object Model)

IBM technology with appropriate software support.

DSR? (Data set ready - Data ready signal, DSR signal)

Serial interface signal indicating that the device (for example,

modem) is ready to send a bit of data to the PC.

DSR? (Device Status Report)

DSR? (Device Status Register)

DSS? (Decision Support System) (See

This group of methods is based on the fact that the transmitted signals are subjected to nonlinear amplitude transformations, and in the transmitting and receiving parts of the nonlinearities are mutually inverse. For example, if the transmitter uses a non-linear function Öu , the receiver uses u 2 . The successive application of reciprocal functions will lead to the fact that the overall transformation remains linear.

The idea of non-linear data compression methods is that the transmitter can, with the same amplitude of the output signals, transmit a larger range of changes in the transmitted parameter (that is, a larger dynamic range). Dynamic Range is the ratio of the largest allowable signal amplitude to the smallest, expressed in relative units or decibels:

;	(2.17)
.	(2.18)

The natural desire to increase the dynamic range by reducing U min is limited by the sensitivity of the equipment and the increase in the influence of interference and intrinsic noise.

Most often, dynamic range compression is performed using a pair of reciprocal logarithm and potentiate functions. The first operation of changing the amplitude is called compression(compression), the second - expansion(stretch). The choice of these functions is connected with their greatest possibility of compression.

At the same time, these methods also have disadvantages. The first of them is that the logarithm of a small number is negative and in the limit:

that is, the sensitivity is highly non-linear.

To reduce these shortcomings, both functions are modified by bias and approximation. For example, for telephone channels, the approximated function has the form (type A,):

where A=87.6. The gain from compression in this case is 24dB.

Data compression by non-linear procedures is implemented by analog means with large errors. The use of digital tools can significantly improve the accuracy or speed of the conversion. In this case, the direct use of computer technology (that is, the direct calculation of logarithms and exponentials) will not give the best result due to low speed and accumulating calculation errors.

Data compression by compression due to accuracy limitations is used in non-critical cases, for example, for voice transmission over telephone and radio channels.

Efficient coding

Efficient codes were proposed by K. Shannon, Fano and Huffman. The essence of the codes lies in the fact that they are uneven, that is, with an unequal number of digits, and the length of the code is inversely proportional to the probability of its occurrence. Another great feature of efficient codes is that they do not require delimiters, that is, special characters that separate adjacent codewords. This is achieved by observing a simple rule: shorter codes are not the beginning of longer ones. In this case, the continuous bit stream is unambiguously decoded because the decoder detects shorter patterns first. Efficient codes have long been purely academic, but have recently been successfully used in the formation of databases, as well as in the compression of information in modern modems and software archivers.

Due to the unevenness, the average code length is introduced. Average length - mathematical expectation of code length:

moreover, l cf tends to H(x) from above (that is, l cf > H(x)).

The fulfillment of condition (2.23) becomes stronger as N increases.

There are two types of efficient codes: Shannon-Fano and Huffman. Let's take an example to get them. Suppose the probabilities of the characters in the sequence have the values given in Table 2.1.

Table 2.1.

Symbol probabilities

N
pi	0.1	0.2	0.1	0.3	0.05	0.15	0.03	0.02	0.05

Symbols are ranked, that is, they are presented in a series in descending order of probabilities. After that, according to the Shannon-Fano method, the following procedure is periodically repeated: the entire group of events is divided into two subgroups with the same (or approximately the same) total probabilities. The procedure continues until one element remains in the next subgroup, after which this element is eliminated, and the specified actions continue with the remaining ones. This continues until there is only one element left in the last two subgroups. Let's continue the consideration of our example, which is summarized in Table 2.2.

Table 2.2.

Shannon-Fano coding

N	Pi
4	0.3		I
	0.2	I	II
6	0.15		I	I
	0.1			II
1	0.1			I	I
9	0.05	II			II
5	0.05		II		I
7	0.03			II	II	I
8	0.02					II

As can be seen from Table 2.2, the first symbol with probability p 4 = 0.3 participated in two procedures for splitting into groups and both times fell into the group with number I . Accordingly, it is encoded with a two-digit code II. The second element at the first stage of partitioning belonged to group I, at the second - to group II. Therefore, its code is 10. The codes of the remaining characters do not need additional comments.

Usually non-uniform codes are depicted as code trees. A code tree is a graph indicating the allowed code combinations. The directions of the edges of this graph are preliminarily set, as shown in Fig. 2.11 (the choice of directions is arbitrary).

According to the graph, they are guided as follows: make up a route for the selected symbol; the number of bits for it is equal to the number of edges in the route, and the value of each bit is equal to the direction of the corresponding edge. The route is drawn from the starting point (in the drawing it is marked with the letter A). For example, the route to vertex 5 consists of five edges, of which all but the last have direction 0; we get the code 00001.

For this example, we calculate the entropy and the average length of a word.

H(x) = -(0.3 log 0.3 + 0.2 log 0.2 + 2 0.1 log 0.1+ 2 0.05 log 0.05+

0.03 log 0.03 + 0.02 log 0.02) = 2.23 bits

lav = 0.3 2 + 0.2 2 + 0.15 3 + 0.1 3 + 0.1 4 + 0.05 5 +0.05 4+

0.03 6 + 0.02 6 = 2.9 .

As you can see, the average word length is close to the entropy.

Huffman codes are built according to a different algorithm. The encoding procedure consists of two steps. At the first stage, one-time compression of the alphabet is sequentially carried out. One-time compression - replacing the last two characters (with the lowest probabilities) with one, with a total probability. Compression is carried out until two characters remain. At the same time, the coding table is filled in, in which the resulting probabilities are put down, and the routes along which the new symbols pass at the next stage are also depicted.

At the second stage, the actual encoding takes place, which begins from the last stage: the first of the two characters is assigned a code of 1, the second - 0. After that, they go to the previous stage. Codes from the next stage are assigned to the characters that did not participate in compression at this stage, and the code of the character obtained after gluing is twice assigned to the last two characters and appended to the code of the upper character 1, the lower one - 0. If the character is not further in gluing participates, its code remains unchanged. The procedure continues until the end (that is, until the first stage).

Table 2.3 shows Huffman encoding. As can be seen from the table, encoding was carried out in 7 stages. On the left are the probabilities of symbols, on the right - intermediate codes. The arrows show the movements of the newly formed symbols. At each stage, the last two characters differ only in the least significant bit, which corresponds to the coding technique. Calculate the average word length:

lav = 0.3 2 + 0.2 2 + 0.15 3 ++ 2 0.1 3 + +0.05 4 + 0.05 5 + 0.03 6 + 0.02 6 = 2.7

This is even closer to entropy: the code is even more efficient. On fig. 2.12 shows the Huffman code tree.

Table 2.3.

Huffman encoding

N	pi	the code	I	II	III	IV	V	VI	VII
	0.3		0.3 11	0.3 11	0.3 11	0.3 11	0.3 11	0.4 0	0.6 1
	0.2		0.2 01	0.2 01	0.2 01	0.2 01	0.3 10	0.3 11	0.4 0
	0.15		0.15 101	0.15 101	0.15 101	0.2 00	0.2 01	0.3 10
	0.1		0.1 001	0.1 001	0.15 100	0.15 101	0.2 00
	0.1		0.1 000	0.1 000	0.1 001	0.15 100
	0.05		0.05 1000	0.1 1001	0.1 000
	0.05		0.05 10011	0.05 1000
	0.03		0.05 10010
	0.02

Both codes satisfy the requirement of unambiguous decoding: as can be seen from the tables, shorter combinations are not the beginning of longer codes.

With an increase in the number of characters, the efficiency of codes increases, therefore, in some cases, larger blocks are encoded (for example, when it comes to texts, you can encode some of the most common syllables, words, and even phrases).

The effect of introducing such codes is determined by comparing them with a uniform code:

(2.24)

where n is the number of digits of the uniform code, which is replaced by an effective one.

Modifications of Huffman codes

The classical Huffman algorithm refers to two-pass, i.e. requires first a set of statistics on symbols and messages, and then the procedures described above. This is inconvenient in practice, since it increases the time for message processing and dictionary accumulation. One-pass methods are more commonly used, in which the accumulation and encoding procedures are combined. Such methods are also called Huffman adaptive compression [46].

The essence of adaptive compression according to Huffman is reduced to the construction of the initial code tree and its subsequent modification after the arrival of each next character. As before, the trees here are binary, i.e. from each vertex of the graph-tree comes a maximum of two arcs. It is customary to call the initial vertex the parent, and the next two vertices associated with it - the children. Let's introduce the concept of the weight of a vertex - this is the number of characters (words) corresponding to a given vertex, obtained when submitting the original sequence. Obviously, the sum of the weights of the children is equal to the weight of the parent.

After the introduction of the next symbol of the input sequence, the code tree is revised: the weights of the vertices are recalculated and, if necessary, the vertices are rearranged. The vertex permutation rule is as follows: the weights of the lower vertices are the smallest, and the vertices on the left of the graph have the smallest weights.

At the same time, the vertices are numbered. The numbering starts from the lower (hanging, i.e. without children) vertices from left to right, then it is transferred to the upper level, and so on. up to the numbering of the last, initial vertex. In this case, the following result is achieved: the smaller the weight of the vertex, the smaller its number.

The permutation is carried out mainly for hanging vertices. When rearranging, the rule formulated above should be taken into account: vertices with a large weight also have a larger number.

After passing through the sequence (it is also called control or test), code combinations are assigned to all hanging vertices. The code assignment rule is similar to the one above: the number of code bits is equal to the number of vertices through which the route passes from the source to the given hanging vertex, and the value of a particular bit corresponds to the direction from the parent to the "child" (say, moving to the left from the parent corresponds to the value 1, to the right - 0 ).

The resulting code combinations are entered into the memory of the compression device along with their counterparts and form a dictionary. The use of the algorithm is as follows. The compressed sequence of characters is divided into fragments according to the available dictionary, after which each of the fragments is replaced by its code from the dictionary. Fragments not found in the dictionary form new hanging vertices, gain weight and are also entered into the dictionary. Thus, an adaptive dictionary replenishment algorithm is formed.

To increase the efficiency of the method, it is desirable to increase the size of the dictionary; in this case, the compression ratio is increased. In practice, the size of a dictionary is 4 - 16 KB of memory.

Let's illustrate the above algorithm with an example. On fig. 2.13 shows the original diagram (also called the Huffman tree). Each vertex of the tree is shown by a rectangle in which two digits are entered through a fraction: the first indicates the number of the vertex, the second - its weight. As you can see, the correspondence between the weights of the vertices and their numbers is satisfied.

Let us now assume that the symbol corresponding to vertex 1 occurs a second time in the test sequence. The weight of the vertex has changed, as shown in Fig. 2.14, as a result of which the vertex numbering rule is violated. At the next stage, we change the location of hanging vertices, for which we swap vertices 1 and 4 and renumber all the vertices of the tree. The resulting graph is shown in Fig. 2.15. The procedure then continues in the same way.

It should be remembered that each hanging node in the Huffman tree corresponds to a certain character or group of them. The parent differs from the children in that the group of characters corresponding to it is one character shorter than that of its children, and these children differ in the last character. For example, the parent matches the characters "kar"; then the children may have the sequences "kara" and "karp".

The above algorithm is not academic and is actively used in archiving programs, including when compressing graphic data (they will be discussed below).

Lempel–Ziva algorithms

These are the most commonly used compression algorithms today. They are used in most programs - archivers (for example, PKZIP, ARJ, LHA). The essence of the algorithms lies in the fact that a certain set of characters is replaced during archiving by its number in a specially formed dictionary. For example, the phrase "Outgoing number for your letter ...", which is often found in business correspondence, can occupy position 121 in the dictionary; then instead of transmitting or storing said phrase (30 bytes), you can store the phrase number (1.5 bytes in BCD or 1 byte in binary).

The algorithms are named after the authors who first proposed them in 1977. Of these, the first is LZ77. For archiving, a so-called message-sliding window is created, which consists of two parts. The first part, of a larger format, serves to form a dictionary and has a size of the order of several kilobytes. The second, smaller part (usually up to 100 bytes) receives the current characters of the text being viewed. The algorithm tries to find a set of characters in the dictionary that matches those received in the viewport. If this succeeds, a code consisting of three parts is formed: the offset in the dictionary relative to its initial substring, the length of this substring, and the character following this substring. For example, the selected substring consists of the characters "app" (6 characters in total), the character following it is "e". Then, if the substring has the address (place in the dictionary) 45, then the entry in the dictionary looks like "45, 6. e". After that, the contents of the window are shifted by a position, and the search continues. Thus, a dictionary is formed.

The advantage of the algorithm is an easily formalized dictionary compilation algorithm. In addition, unzipping is possible without the initial dictionary (it is desirable to have a test sequence at the same time) - the dictionary is formed during unzipping.

The disadvantages of the algorithm appear when the size of the dictionary increases - the time for searching increases. In addition, if a string of characters appears in the current window that is not in the dictionary, each character is written with a three-element code, i.e. It's not compression, but expansion.

The LZSS algorithm, proposed in 1978, has the best performance. It has differences in the maintenance of the sliding window and the output codes of the compressor. In addition to the window, the algorithm forms a binary tree similar to the Huffman tree to speed up the search for matches: each substring that leaves the current window is added to the tree as one of the children. This algorithm allows you to additionally increase the size of the current window (it is desirable that its value be equal to the power of two: 128, 256, etc. bytes). Sequence codes are also formed differently: an additional 1-bit prefix is introduced to distinguish unencoded characters from "offset, length" pairs.

An even greater degree of compression is obtained when using algorithms such as LZW. The algorithms described earlier have a fixed window size, which makes it impossible to enter phrases longer than the window size into the dictionary. In the LZW algorithms (and their predecessor LZ78), the viewport has an unlimited size, and the dictionary accumulates phrases (and not a collection of characters, as before). The dictionary has an unlimited length, and the encoder (decoder) works in the phrase waiting mode. When a phrase matching the dictionary is formed, the match code (i.e. the code for that phrase in the dictionary) and the code of the character following it are returned. If, as the characters accumulate, a new phrase is formed, it is also entered into the dictionary, as well as a shorter one. The result is a recursive procedure that provides fast encoding and decoding.

An additional compression capability is provided by compressed coding of repeated characters. If in the sequence some characters follow in a row (for example, in the text it can be "space" characters, in a numerical sequence - consecutive zeros, etc.), then it makes sense to replace them with a pair of "character; length" or "sign, length ". In the first case, the code indicates the sign that the sequence will be encoded (usually 1 bit), then the code of the repeated character and the length of the sequence. In the second case (provided for the most frequently occurring repeating characters), the prefix simply indicates the sign of repetitions.

People fascinated by home sound exhibit an interesting paradox. They are ready to shovel the listening room, build speakers with exotic radiators, but embarrassedly step back in front of the musical can, like a wolf in front of a red flag. But in fact, why can’t you stand up for the flag, and try to cook something more edible from canned food?

From time to time, plaintive questions arise on the forum: "Recommend well-recorded albums." It is understandable. Special audiophile editions, although they will please the ear for the first minute, but no one listens to them to the end, the repertoire is painfully dull. As for the rest of the music library, the problem seems to be obvious. You can save, or you can not save and swell a lot of money into components. Still, few people like to listen to their favorite music at high volume and the capabilities of the amplifier have nothing to do with it.

Today, even in Hi-Res albums, the peaks of the phonogram are cut off and the volume is driven into clipping. It is believed that the majority listens to music on any kind of junk, and therefore it is necessary to “turn on the gas”, make a kind of thin compensation.

Of course, this is not done on purpose to upset audiophiles. Few people remember them at all. They only guessed to give them the master files from which the main circulation is copied - CDs, MP3s, and so on. Of course, the master has long been flattened by the compressor, no one will deliberately prepare special versions for HD Tracks. Unless a certain procedure is followed for the vinyl carrier, which for this reason sounds more humane. And for the digital path, everything ends the same way - with a big fat compressor.

So, at present, all 100% of the released phonograms, with the exception of classical music, are subjected to compression during mastering. Someone performs this procedure more or less skillfully, while someone is completely stupid. As a result, we have pilgrims on the forums with the DR plugin line in their bosoms, painful comparisons of publications, flight to vinyl, where you also need to mine first presses.

The most frostbitten at the sight of all these outrages have literally turned into audio Satanists. No kidding, they're reading the sound engineer's holy scripture backwards! Modern sound editing programs have some tool to restore the clipped sound wave.

Initially, this functionality was intended for studios. When mixing, there are situations when clipping got on the record, and for a number of reasons it is no longer possible to redo the session, and here the arsenal of an audio editor comes to the rescue - declipper, decompressor, etc.

And now ordinary listeners, who are bleeding from their ears after another novelty, are more and more boldly pulling their hands to such software. Someone prefers iZotope, someone prefers Adobe Audition, someone shares operations between several programs. The point of restoring the previous dynamics is to programmatically correct the clipped signal peaks, which, resting at 0 dB, resemble a gear.

Yes, there is no question of a 100% revival of the source code, since there are interpolation processes using rather speculative algorithms. But still, some results of processing seemed interesting and worthy of study to me.

For example, Lana Del Rey's album "Lust For Life", steadily filthy swearing, ugh, mastering! The original song "When the World Was at War We Kept Dancing" was like this.

And after a series of declippers and decompressors, it became like this. The DR coefficient has changed from 5 to 9. You can download and listen to the sample before and after processing.

I can’t say that the method is universal and suitable for all ruined albums, but in this case I preferred to keep in the collection this particular version, processed by the rutracker activist, instead of the official 24-bit edition.

Even if artificially extracting the peaks from the stuffing does not bring back the true dynamics of the musical performance, your DAC will still thank you. After all, it was so hard for him to work without errors at the limiting levels, where the likelihood of the so-called inter-sample peaks (ISP) is high. And now only rare flashes of the signal will jump to 0 dB. In addition, a muted soundtrack when compressed to FLAC or another lossless codec will now be smaller in size. More "air" in the signal saves hard drive space.

Try to revive your most hated albums killed in the "volume war". For headroom, you first need to lower the track level by -6 dB, and then start the declipper. Those who do not believe in computers can simply stick a studio expander between the CD player and the amplifier. This device essentially does the same - restores and stretches the peaks of a compressed audio signal as much as possible. Such devices from the 80-90s are not very expensive, and as an experiment it will be very interesting to try them.

The DBX 3BX dynamic range controller processes the signal separately in three bands - bass, midrange and treble

Once upon a time, equalizers were a matter of course in the audio system, and no one was afraid of them. Today it is not required to equalize the blockage of the high frequencies of the magnetic tape, but with the ugly dynamics something needs to be solved, brothers.

At a time when researchers were just starting to solve the problem of creating a speech interface for computers, they often had to make their own equipment that allows you to enter sound information into a computer, as well as output it from a computer. Today, such devices may only be of historical interest, as modern computers can be easily equipped with sound input and output devices such as sound adapters, microphones, headphones, and speakers.

We will not go into the details of the internal structure of these devices, but we will talk about how they work, and give some recommendations for choosing sound computer devices for working with speech recognition and synthesis systems.

As we said in the previous chapter, sound is nothing more than air vibrations, the frequency of which lies in the frequency range perceived by a person. The exact limits of the range of audible frequencies may vary from person to person, but it is believed that sound vibrations lie in the range of 16-20,000 Hz.

The task of the microphone is to convert sound vibrations into electrical vibrations, which can then be amplified, filtered to remove interference and digitized to enter sound information into a computer.

According to the principle of operation, the most common microphones are divided into carbon, electrodynamic, condenser and electret. Some of these microphones require external source current (for example, carbon and capacitor), others, under the influence of sound vibrations, are able to independently generate an alternating electrical voltage (these are electrodynamic and electret microphones).

You can also separate microphones by purpose. There are studio microphones that can be held in the hand or mounted on a stand, there are radio microphones that can be clipped to clothing, and so on.

There are also microphones designed specifically for computers. These microphones are usually mounted on a stand placed on the table surface. Computer microphones can be combined with headphones, as shown in fig. 2-1.

Rice. 2-1. Head phones with microphone

How to choose from the whole variety of microphones the one that is best suited for speech recognition systems?

In principle, you can experiment with any microphone you have, as long as it can be connected to your computer's audio adapter. However, developers of speech recognition systems recommend purchasing a microphone that will be at a constant distance from the speaker's mouth during operation.

If the distance between the microphone and the mouth does not change, then the average level of the electrical signal coming from the microphone will also not change too much. This will have a positive impact on the quality of modern speech recognition systems.

What is the problem here?

A person is able to successfully recognize speech, the volume of which varies over a very wide range. The human brain is able to filter out quiet speech from noise such as the noise of cars driving down the street, extraneous conversations and music.

As for modern speech recognition systems, their abilities in this area leave much to be desired. If the microphone is on a table, then when you turn your head or change the position of your body, the distance between your mouth and the microphone will change. This will change the microphone output level, which in turn will degrade the reliability of speech recognition.

Therefore, when working with speech recognition systems, the best results will be achieved if you use a microphone attached to headphones, as shown in Fig. 2-1. When using such a microphone, the distance between the mouth and the microphone will be constant.

We also draw your attention to the fact that all experiments with speech recognition systems are best done in seclusion in a quiet room. In this case, the influence of interference will be minimal. Of course, if you need to choose a speech recognition system that can work in conditions of strong interference, then the tests need to be done differently. However, as far as the authors of the book know, the noise immunity of speech recognition systems is still very, very low.

The microphone performs for us the conversion of sound vibrations into electrical current vibrations. These fluctuations can be seen on the oscilloscope screen, but do not rush to the store to purchase this expensive device. We can carry out all oscillographic studies using a conventional computer equipped with a sound adapter, for example, a Sound Blaster adapter. Later we will tell you how to do it.

On fig. 2-2 we have shown the oscillogram of the sound signal obtained when pronouncing a long sound a. This waveform was acquired using the GoldWave program, which we will cover later in this chapter of the book, as well as using a Sound Blaster audio adapter and a microphone similar to that shown in fig. 2-1.

Rice. 2-2. Oscillogram of the audio signal

The GoldWave program allows you to stretch the waveform along the time axis, which allows you to see the smallest details. On fig. 2-3 we have shown a stretched fragment of the oscillogram of the sound a mentioned above.

Rice. 2-3. Fragment of an oscillogram of an audio signal

Note that the magnitude of the input signal from the microphone changes periodically and takes on both positive and negative values.

If only one frequency were present in the input signal (that is, if the sound were "clean"), the waveform received from the microphone would be sinusoidal. However, as we have already said, the spectrum of human speech sounds consists of a set of frequencies, as a result of which the shape of the speech signal oscillogram is far from sinusoidal.

A signal whose magnitude changes continuously with time, we will call analog signal. This is the signal coming from the microphone. Unlike an analog signal, a digital signal is a set of numerical values that change discretely over time.

In order for a computer to process an audio signal, it must be converted from analog to digital form, that is, presented as a set of numerical values. This process is called analog digitization.

The digitization of an audio (and any analog) signal is performed using a special device called analog to digital converter ADC (Analog to Digital Converter, ADC). This device is located on the sound adapter board and is an ordinary-looking microcircuit.

How does an analog-to-digital converter work?

It periodically measures the level of the input signal, and outputs a numerical value of the measurement result at the output. This process is illustrated in Fig. 2-4. Here, the gray rectangles mark the values of the input signal, measured with a certain constant time interval. The set of such values is the digitized representation of the input analog signal.

Rice. 2-4. Measurements of the dependence of the signal amplitude on time

On fig. In Figure 2-5, we've shown connecting an analog-to-digital converter to a microphone. In this case, an analog signal is applied to the input x 1, and a digital signal is removed from the outputs u 1 -u n.

Rice. 2-5. Analog to digital converter

Analog-to-digital converters are characterized by two important parameters - the conversion frequency and the number of quantization levels of the input signal. Proper selection of these parameters is critical to achieving an adequate digitization of an analog signal.

How often do you need to measure the amplitude value of the input analog signal so that information about changes in the input analog signal is not lost as a result of digitization?

It would seem that the answer is simple - the input signal should be measured as often as possible. Indeed, the more often an analog-to-digital converter makes such measurements, the better it will track the slightest changes in the amplitude of the analog input signal.

However, excessively frequent measurements can lead to an unjustified increase in the digital data flow and a waste of computer resources in signal processing.

Fortunately, choosing the right conversion rate (sampling rate) is easy enough. To do this, it suffices to refer to the Kotelnikov theorem, known to specialists in the field of digital signal processing. The theorem states that the conversion frequency must be twice the maximum frequency of the spectrum of the converted signal. Therefore, in order to digitize without losing the quality of the audio signal, the frequency of which lies in the range of 16-20,000 Hz, you need to select a conversion frequency that is not less than 40,000 Hz.

Note, however, that in professional audio equipment, the conversion frequency is selected several times greater than the specified value. This is done to achieve very high quality digitized audio. For speech recognition systems, this quality is not relevant, so we will not draw your attention to this choice.

And what conversion frequency is needed to digitize the sound of human speech?

Since the sounds of human speech lie in the frequency range of 300-4000 Hz, the minimum required conversion frequency is 8000 Hz. However, many computer programs speech recognition uses the standard 44,000 Hz conversion rate for conventional audio adapters. On the one hand, such a conversion rate does not lead to an excessive increase in the digital data stream, and on the other hand, it ensures speech digitization with sufficient quality.

Back in school, we were taught that with any measurements, errors arise that cannot be completely eliminated. Such errors arise due to the limited resolution of measuring instruments, and also due to the fact that the measurement process itself can introduce some changes in the measured value.

The analog-to-digital converter represents the input analog signal as a stream of numbers of limited capacity. Conventional audio adapters contain 16-bit ADC blocks capable of representing the amplitude of the input signal as 216 = 65536 different values. ADC devices in high-end audio equipment can be 20-bit, providing greater accuracy in representing the amplitude of the audio signal.

Modern speech recognition systems and programs were created for ordinary computers equipped with ordinary sound adapters. Therefore, to conduct experiments with speech recognition, you do not need to purchase a professional audio adapter. An adapter such as Sound Blaster is quite suitable for digitizing speech for further recognition.

Along with the useful signal, various noises usually enter the microphone - noise from the street, wind noise, extraneous conversations, etc. Noise has a negative impact on the quality of speech recognition systems, so it has to be dealt with. One of the ways we have already mentioned is that today's speech recognition systems are best used in a quiet room, remaining alone with the computer.

However, ideal conditions can not always be created, so you have to use special methods to get rid of interference. To reduce the noise level, special tricks are used in the design of microphones and special filters that remove frequencies from the analog signal spectrum that do not carry useful information. In addition, such a technique as compression of the dynamic range of input signal levels is used.

Let's talk about all this in order.

frequency filter A device that converts the frequency spectrum of an analog signal is called. In this case, in the process of transformation, the selection (or absorption) of oscillations of certain frequencies occurs.

You can think of this device as a kind of black box with one input and one output. In relation to our situation, a microphone will be connected to the input of the frequency filter, and an analog-to-digital converter will be connected to the output.

Frequency filters are different:

low-pass filters;

High pass filters

Passing bandpass filters

blocking bandpass filters.

Low Pass Filters(low -pass filter ) remove from the spectrum of the input signal all frequencies whose values are below a certain threshold frequency, depending on the filter setting.

Since audio signals lie in the range of 16-20,000 Hz, all frequencies below 16 Hz can be cut off without degrading the sound quality. For speech recognition, the frequency range of 300-4000 Hz is important, so frequencies below 300 Hz can be cut out. In this case, all noises, the frequency spectrum of which lies below 300 Hz, will be cut out of the input signal, and they will not interfere with the speech recognition process.

Likewise, high pass filters(high -pass filter ) cut out from the spectrum of the input signal all frequencies above a certain threshold frequency.

Humans cannot hear sounds at frequencies of 20,000 Hz or higher, so they can be cut out of the spectrum without noticeable deterioration in sound quality. As for speech recognition, all frequencies above 4000 Hz can be cut out, which will lead to a significant reduction in the level of high-frequency interference.

Band pass filter(band -pass filter ) can be thought of as a combination of a low pass filter and a high pass filter. Such a filter stops all frequencies below the so-called lower pass frequency, as well as above upper pass frequency.

Thus, for a speech recognition system, a pass-through bandpass filter is convenient, which delays all frequencies, except for frequencies in the range of 300-4000 Hz.

As for the band-stop filters (band-stop filter), they allow you to cut out from the spectrum of the input signal all frequencies that lie in a given range. Such a filter is convenient, for example, to suppress noise that occupies a certain continuous part of the signal spectrum.

On fig. 2-6 we have shown the connection of a pass-through filter.

Rice. 2-6. Filtering the audio signal before digitizing

I must say that the usual sound adapters installed in the computer have a band-pass filter through which the analog signal passes before digitization. The bandwidth of such a filter usually corresponds to the range of audio signals, namely 16-20,000 Hz (in different audio adapters, the values of the upper and lower frequencies may vary slightly).

But how to achieve a narrower bandwidth of 300-4000 Hz, corresponding to the most informative part of the spectrum of human speech?

Of course, if you have a penchant for designing electronic equipment, you can make your own filter from an operational amplifier chip, resistors and capacitors. This is exactly what the first creators of speech recognition systems did.

However, industrial speech recognition systems must be able to work on standard computer equipment, so the way of manufacturing a special band-pass filter is not suitable here.

Instead, modern speech processing systems use so-called digital frequency filters implemented in software. This became possible after CPU computer has become powerful enough.

A digital frequency filter implemented in software converts an input digital signal into an output digital signal. During the conversion process, the program processes in a special way the stream of numerical values of the signal amplitude coming from the analog-to-digital converter. The result of the conversion will also be a stream of numbers, but this stream will correspond to the already filtered signal.

Talking about the analog-to-digital converter, we noted such an important characteristic as the number of quantization levels. If a 16-bit analog-to-digital converter is installed in the audio adapter, then after digitization, the audio signal levels can be represented as 216 = 65536 different values.

If there are few quantization levels, then the so-called quantization noise. To reduce this noise, high-quality audio digitization systems should use analog-to-digital converters with the maximum number of quantization levels available.

However, there is another trick to reduce the effect of quantization noise on the quality of the audio signal, which is used in digital sound recording systems. Using this technique, the signal is passed through a non-linear amplifier before digitization, which emphasizes signals with a small signal amplitude. Such a device amplifies weak signals more than strong ones.

This is illustrated by the plot of the output signal amplitude versus the input signal amplitude, shown in fig. 2-7.

Rice. 2-7. Nonlinear amplification before digitization

In the step of converting the digitized audio back to analog (which we will discuss later in this chapter), the analog signal is again passed through a non-linear amplifier before being output to the speakers. This time, a different amplifier is used that emphasizes large amplitude signals and has a transfer characteristic (dependence of the output signal amplitude on the input signal amplitude) that is the opposite of that used during digitization.

How can all this help the creators of speech recognition systems?

A person, as you know, is quite good at recognizing speech uttered in a low whisper or in a fairly loud voice. It can be said that the dynamic range of volume levels of successfully recognized speech for a person is quite wide.

Today's computer systems Speech recognition, unfortunately, cannot yet boast of this. However, in order to slightly expand the specified dynamic range before digitization, it is possible to pass the signal from the microphone through a nonlinear amplifier, the transfer characteristic of which is shown in Fig. 2-7. This will reduce the level of quantization noise when digitizing weak signals.

Developers of speech recognition systems, again, are forced to focus primarily on commercially available sound adapters. They do not provide for the non-linear signal conversion described above.

However, it is possible to create the software equivalent of a non-linear amplifier that converts the digitized signal before passing it to the speech recognition module. And although such a software amplifier will not be able to reduce quantization noise, it can be used to emphasize those signal levels that carry the most speech information. For example, you can reduce the amplitude of weak signals, thus ridding the signal of noise.

Compression is one of the most mythical topics in sound production. They say that Beethoven even scared her neighbor's children:(

Okay, in fact, applying compression is no more difficult than using distortion, the main thing is to understand how it works and have good control. What we are now together and make sure.

What is audio compression

The first thing to understand before preparation is that compression is work with the dynamic range of sound. And , in turn, is nothing more than the difference between the loudest and quietest signal level:

So here it is compression is the compression of the dynamic range. Yes, simply dynamic range compression, or in other words lower the volume of the loud parts of the signal and increase the volume of the quiet ones. No more.

You can quite reasonably wonder what is the reason for such a hype? Why is everyone talking about recipes for proper compressor tuning, but no one shares them? Why, despite the huge number of cool plugins, many studios still use expensive rare models of compressors? Why do some producers use compressors at extreme settings, while others do not use them at all? And which one is right in the end?

Problems that compression solves

The answers to such questions lie in the plane of understanding the role of compression in working with sound. And it allows:

Emphasize attack sound, make it more pronounced;
"Seat" individual parts of instruments into the mix, adding power and "weight" to them;
Make groups of instruments or the whole mix more cohesive, such a single monolith;
Resolve conflicts between tools using sidechain ;
Correct the flaws of the vocalist or musicians, leveling their dynamics;
With a certain setting act as an artistic effect.

As you can see, this is no less significant creative process than, say, inventing melodies or playing interesting timbres. In this case, any of the above tasks can be solved using 4 main parameters.

Main parameters of the compressor

Despite the huge number of software and hardware models of compressors, all the "magic" of compression occurs when the main parameters are set correctly: Threshold, Ratio, Attack and Release. Let's consider them in more detail:

Threshold or threshold, dB

This parameter allows you to set the value at which the compressor will operate (i.e. compress the audio signal). So, if we set the threshold to -12dB, the compressor will only kick in at those places in the dynamic range that exceed this value. If all our sound is quieter than -12db, the compressor will simply pass it through itself without affecting it in any way.

Ratio or aspect ratio

The ratio parameter determines how much the signal will be compressed if it exceeds the threshold. A bit of math to complete the picture: let's say we set up a compressor with a threshold of -12dB, a ratio of 2:1 and fed it a drum loop with a kick volume of -4dB. What will be the result of the compressor operation in this case?

In our case, the kick level exceeds the threshold by 8dB. This difference will be compressed to 4dB (8dB / 2) according to the ratio. Together with the unprocessed part of the signal, this will lead to the fact that after processing by the compressor the volume of the kick will be -8db (threshold -12dB + 4dB compressed signal).

Attack, ms

This is the time after which the compressor will react to exceeding the threshold. That is, if the attack time is above 0ms − compressor starts compressing exceeding the threshold signal is not instantaneous, but after the specified time.

Release or recovery, ms

The opposite of an attack - the value of this parameter allows you to specify how long after the signal level returns below the threshold compressor will stop compressing.

Before we move on, I strongly recommend taking a well-known sample, attaching any compressor to its channel and experimenting with the above parameters for 5-10 minutes to securely fix the material.

All other parameters are optional. They can differ between different compressor models, which is partly why producers use different models for any specific purpose (for example, one compressor for vocals, another for a drum group, a third for a master channel). I will not dwell on these parameters in detail, but will only give general information for understanding what it is all about:

Knee or kink (Hard/Soft Knee). This parameter determines how quickly the compression ratio (ratio) will be applied: hard on a curve or smooth. I note that in the Soft Knee mode, the compressor does not work in a straight line, but starts smoothly (as far as it may be appropriate when we are talking about milliseconds) to tighten the sound already before the value of threshold. To process groups of channels and the overall mix, soft knee is more often used (since it works imperceptibly), and hard knee is used to emphasize attack and other features of individual instruments;
Response Mode: Peak/RMS. The Peak mode is justified when you need to severely limit bursts of amplitude, as well as on signals with a complex shape, the dynamics and readability of which must be fully conveyed. RMS mode is very gentle on the sound, allowing you to condense it, while maintaining the attack;
Forethought (Lookahead). This is the time for which the compressor will know what to expect. A kind of preliminary analysis of incoming signals;
Makeup or Gain. A parameter that allows you to compensate for the decrease in volume as a result of compression.

First and the most important advice, which removes all further questions about compression: if you a) understand the principle of compression, b) you firmly know how this or that parameter affects the sound, and c) you managed to try several different models in practice - you don't need any advice.

I am absolutely serious. If you carefully read this entry, experimented with the standard compressor of your DAW and one or two plug-ins, but did not understand in what cases you need to set large attack values, which ratio to use and in which mode to process the original signal, then you will then search the Internet for ready-made recipes, applying them thoughtlessly anywhere.

Compressor Fine Tuning Recipes it's kind of like recipes for fine-tuning a reverb or a chorus - devoid of any meaning and has nothing to do with creativity. Therefore, I persistently repeat the only true recipe: arm yourself with this article, good monitor headphones, a plug-in for visual control of the waveform and spend the evening in the company of a couple of compressors.

Take action!