The Perceptual Organization of Streams of Sound
The Perceptual Organization of Streams of Sound
Abstract and Keywords
Chapter 3 first explores the principles by which we organize elements of an array into groupings. The Gestalt psychologists proposed a set of grouping principles that have profoundly influenced the study of hearing and vision ever since—these include “proximity,” “similarity,” “good continuation,” “common fate,” and “closure.” Passages of conventional tonal music illustrating these principles are described, along with several illusions and other surprising characteristics of music and speech, all presented as sound examples. They involve the segregation of pitch sequences into separate streams based on proximity in pitch or in time, and also on timbre or sound quality. Figure–ground relationships, analogous to those in vision, are also discussed. Much information arrives at our sense organs in fragmented form, and the perceptual system needs to infer continuities between the fragments, and fill in the gaps appropriately. It is shown that this occurs in both music and speech. We have evolved mechanisms to perform these tasks, but these mechanisms often fool us into “hearing” sounds that are not really there. Another approach to perceptual organization in music exploits the use of orchestral sound textures to create ambiguous images. This approach has been used to excellent effect in 20th-century music such as film scores; for example, it contributes to the mysterious ambience in Stanley Kubrick’s 2001: A Space Odyssey.
Music is organized sound.
IMAGINE THAT YOU are sitting at a sidewalk cafe. Traffic rumbles past you along the street. Pedestrians stroll by, engaged in lively conversation. Waiters clatter dishes as food is served, and music emanates softly from inside the building. At times one of these streams of sound engages your attention, at times a different one, but in a sense you are continually aware of all the sounds that reach you.
Or imagine that you are listening to an orchestral performance in a concert hall. The sounds produced by the different instruments are mixed together and distorted in various ways as they travel to your ears. Somehow you are able to disentangle the components of the sound mixture that reaches you, so that you hear the first violins playing one set of tones, the flutes another, and the clarinets another. You group together the sounds that you perceptually reconstructed so as to hear melodies, harmonies, timbres, and so on. What algorithms does the auditory system employ to accomplish these difficult tasks?
Our hearing mechanism is constantly identifying and grouping sounds into unified streams, so that we can focus attention on one of them while others are relegated to the background. Then we can change our focus so that the foreground becomes background, and vice versa. Composers exploit this effect in a number of ways. In much accompanied singing, the voice is clearly intended as figure, and the accompaniment as background. On the other hand, in contrapuntal music, such as (p.47) many works by Johann Sebastian Bach, there are figure–ground relationships of a different type. Here two or more melodic lines are played in parallel, and the listener switches attention back and forth between these lines. In such music we have analogies to ambiguous figures in vision; these can be perceived in different ways depending on how the viewer focuses his attention. One example, created by the Danish artist Edgar Rubin, is shown in Figure 3.1.1 This picture can be interpreted either as a white vase against a black background, or as two black faces staring at each other across a white space. We can choose at will to “see” either the vase or the faces, but we cannot achieve both perceptual organizations at the same time.
What perceptual principles do we employ when we group elements of an array into unified wholes? The Gestalt psychologists, who flourished during the early part of the twentieth century, were concerned with how the brain organizes the information presented to the senses so that we perceive entire figures, rather than collections of unrelated parts. Notable among the Gestaltists were Kurt Koffka, Max Wertheimer, and Wolfgang Kohler.2 They proposed a set of grouping principles that have profoundly influenced the study of both hearing and vision even today (Figure 3.2).
One principle, which is termed proximity, states that we form connections between elements that are closer together in preference to those that are spaced further apart. In example (a) of Figure 3.2 we group the dots together on the basis of proximity, so we perceive them as grouped in pairs. A second principle, termed (p.48) similarity, states that we form connections between elements that are similar to each other in some way. In example (b), we group together the filled circles and separate them from the unfilled ones.
A third principle, termed good continuation, asserts that we form connections between elements that follow each other smoothly in the same direction. In example (c) we perceive lines AB and CD, rather than AC and DB. Yet another principle, termed closure, states that we tend to perceive elements of an array as complete units. In example (d) we perceive a circle with two gaps, rather than two separate curved lines. A further principle, termed common fate, maintains that elements that move in the same direction are perceptually linked together, Imagine a flock of birds all flying in the same direction. If one of these birds turns and flies in the opposite direction, it will stand out perceptually from the others, which we continue to see as a group.
As I contend in my article “Musical Illusions,”3 and Albert Bregman explores in detail in his influential book Auditory Scene Analysis,4 we can assume that the perceptual system has evolved to form groupings in accordance with these principles because they enable us to interpret our environment most effectively. This in turn provides an evolutionary advantage—early in our prehistory it would have enabled us to locate sources of food and alert us to the existence of predators.
Consider, for example, the principle of proximity. In vision, elements that are close together in space are more likely to have arisen from the same object than those that are further apart. In hearing, sounds that are close in pitch, or in time, are more likely to have come from the same source than are sounds that are far apart. The principle of similarity is analogous. Regions of the visual field that are similar in color, brightness, or texture have probably emanated from the same object, and sounds that are similar in character (a series of thuds, or chirps, for example) are likely to have come from the same source. It is probable that a line that follows a smooth pattern has originated from a single object, and a sound that (p.49) changes smoothly in pitch has probably come from a single source. The same argument holds for common fate. An object that moves across the visual field gives rise to perceptual elements that move coherently with each other, and the components of a complex sound that rise and fall in synchrony are likely to have arisen from the same source.5
While the visual system tends to carve the perceptual array into spatial regions that are defined according to various criteria such as brightness, color, and texture, the hearing system often forms groupings based on patterns of pitches in time. Such groupings can be illustrated by mapping pitch into one dimension of visual space and time into another. Indeed, musical scores have evolved by convention to be just such mappings. Tones that are higher in pitch are represented as higher on a musical staff, and progression in time is represented as movement from left to right. In the musical score shown in Figure 3.3, the dramatic visual contours reflect well the sweeping movements of pitch that are heard in this Mozart passage.
We have already seen an example of grouping by pitch proximity in the scale illusion, and in related stereo illusions. There, two series of tones arise in parallel from opposite regions of space, and each series is composed of tones that leap around in pitch. We reorganize the tones perceptually in accordance with pitch proximity, so that we hear one series of tones formed of the higher pitches as though coming from one region of space, and another series of tones formed of the lower pitches as though coming from the opposite region. So in these stereo illusions, organization by pitch proximity is so powerful as to cause half the tones to be perceived as coming from the incorrect location.
Other grouping effects appear strongly when sounds are presented in rapid succession. Musicians have known for centuries that when a series of tones is played at a rapid tempo, and these tones are drawn from two different pitch ranges, the series splits apart perceptually so that two melodic lines are heard in parallel. This effect employs the technique of pseudopolyphony, or compound melodic line. When we listen to such a pattern, we don’t form perceptual relationships between tones (p.50) that are adjacent in time—instead, we perceive two melodic lines in parallel, one corresponding to the higher tones and the other to the lower ones. Baroque composers such as Bach and Telemann employed this technique frequently. Figure 3.4 shows two examples. Parts (a) and (b) show the sequences in musical notation, and (aʹ) and (bʹ) show the same sequences plotted as pitch versus time. In passage (a) we perceive two melodic lines in parallel, one formed of the higher tones and the other of the lower ones. In passage (b) a single pitch occurs repeatedly in the lower range, and this provides the ground against which we perceive the melody in the higher range. Many examples of pseudopolyphony also occur in nineteenth- and twentieth-century guitar music, for example by Francisco Tárrega and by Augustín Barrios.
G. A. Miller and G. A. Heise at Harvard University created one of the first laboratory demonstrations showing that listeners organize rapidly repeating tones on the basis of pitch proximity. They had subjects listen to series of tones that consisted of alternating pitches played at a rate of ten per second.6 When the alternating tones were less than three semitones apart, the listeners heard the series as a single coherent string (a trill). When, however, the pitch separation was more than three semitones, the listeners didn’t form connections between tones that were adjacent in time, but instead perceived two repeating pitches that appeared unrelated to each other.
As a consequence of the perceptual splitting of pitch series into two streams, temporal relationships across tones in the different streams appear blurred. Albert Bregman and Jeffrey Campbell at McGill University created a repeating series of six pure tones, three high and three low ones, with those drawn from the high and low groups occurring in alternation. The rate of presentation was fast, and this caused the listeners to hear two distinct streams, one formed of the high tones and the other of the low ones. They could easily judge the order of the tones in the same stream, but were quite unable to judge their order across streams. Indeed, the listeners often reported that all the high tones preceded all the low ones, or vice versa, yet such orderings never occurred.7
Even when the rate of presentation of such tones is slowed down so that we can identify their order, there is still a gradual breakdown in the ability to judge temporal relationships between tones as the pitch distance between them increases. The physicist Leon Van Noorden at Eindhoven, the Netherlands, created a repeating stream of tones in the continuing pattern . . .—ABA—ABA—. . . where the dashes indicate silent (p.51) intervals. This stream was played with the tones beginning far apart in pitch, and their pitches gradually converged and then diverged. When the pitch difference between successive tones was small, listeners perceived the pattern as forming a “galloping rhythm” (di-di-dum, di-di-dum). But when this pitch difference was (p.52) large, the rhythm disappeared perceptually.8 This demonstration in presented in the “Galloping Rhythm” module.
Both Van Noorden and Bregman found that the larger the number of tones that were heard, the greater was the tendency to perceive them as forming two streams. Bregman maintained that the formation of separate streams with time reflects our cumulating evidence that two sources are producing the tones, rather than a single source.
The perceptual separation of tones that are disparate in pitch has a further intriguing consequence, which was demonstrated by W. Jay Dowling. He generated two well-known melodies that were played at a rapid tempo. The tones from the two melodies occurred in alternation, and listeners were asked to identify the melodies. When they were in the same pitch range, as in Figure 3.5(a), perceptual connections were formed between tones that followed each other in time. In consequence, successive tones in each melody were not perceptually linked together, with the result that the melodies could not be identified. Yet when one of the melodies was transposed to a different pitch range, as in Figure 3.5(b), the melodies were separated perceptually into distinct streams. Perceptual connections were then formed between successive tones in each stream, so that identification of the melodies became much easier.9 This effect is shown in the “Interleaved Melodies” module.
So far, we have been exploring situations in which proximity in pitch competes successfully with proximity in time in determining how sounds are perceptually grouped together. Yet under other circumstances, timing relationships can instead (p.53) be the deciding factor. For example, if you take a sequence of sounds, and insert pauses between the sounds intermittently, listeners perceptually group them into units that are defined by the pauses. This tendency can be so strong as to interfere with grouping along other lines. Gordon Bower at Stanford University has shown that when a meaningful string of letters (acronyms or words) is read out to listeners, their memory for the strings worsens considerably when pauses are inserted that are inconsistent with their meanings. For example, we have difficulty remembering the string of letters IC BMP HDC IAFM. On the other hand, when the same sequence of letters is presented as ICBM PHD CIA FM, we perceive the acronyms, and so can easily remember the letter strings.10
I later showed that the same principle holds for musical passages. I devised passages that consisted of subsequences that had a uniform pitch structure. Musically trained participants listened to these passages, and wrote down in musical notation what they heard. When pauses were inserted between the subsequences so as to emphasize the repeating pitch structure, the participants notated the passages easily. On the other hand, when the pauses conflicted with the repeating pitch structure, they had considerable difficulty notating the passages. Figure 3.6 illustrates an example, which you can also hear in the “Timing and Sequence Perception” module. It’s clear that when the pauses are in accordance (p.54) with the repeating pitch structure (b), it is easy to apprehend the overall passage—however, inappropriate pauses (c) interfere with the ability to perceive the passage, despite its repeating pitch structure.11
We are amazingly sensitive to the quality of the sounds that we hear, and we readily group sounds into categories. This is an example of the principle of similarity. The English vocabulary contains an enormous number of words that describe different types of sound, with each word evoking a distinct image. Consider, for example, the abundance of words that describe brief sounds, such as click, clap, crack, pop, knock, thump, crash, splash, plop, clink, and bang. Then there are longer-lasting sounds such as crackle, rustle, jangle, rumble, hum, gurgle, rattle, boom, creak, whir, shuffle, and clatter. The richness of this vocabulary shows how much we rely on sound quality to understand our world.
It’s not surprising, then, that sound quality or timbre is a strong factor in determining how musical sounds are perceptually grouped together12. Composers frequently exploit this feature to enable the listener to separate out parallel streams in musical passages. In the passage shown in Figure 3.7, which is taken from Beethoven’s Spring Sonata for violin and piano, the pitches produced by the two instruments overlap, yet we hear the phrases produced by the different instruments as quite distinct from each other. This can be heard in the “Beethoven’s Spring Sonata” module.12
The psychologist David Wessel generated a series of three tones that repeatedly ascended in pitch, but tones that were adjacent in time were composed of alternating timbres. An example is illustrated in Figure 3.8, and displayed in the (p.55) “Timbre Illusion” module. When the difference in timbre between the tones was small, listeners heard this passage as ascending pitch lines. Yet when this difference was large, listeners grouped the tones on the basis of timbre, and so heard two interwoven descending lines instead. Speeding up the tempo increased the impression of grouping by timbre.13
Just as with perceptual separation on the basis of pitch, separation by sound quality can have striking effects on our ability to identify the ordering of sounds. Richard Warren and his colleagues at the University of Wisconsin constructed a repeating series of four unrelated sounds—a high tone, a hiss, a low tone, and a buzz—which they presented at a rate of five sounds per second. Listeners were completely unable to name the order in which the sounds occurred—the duration of each sound had to be slowed down considerably before they were able to name their ordering.14
Much information arrives at our sense organs in fragmented form, and the perceptual system needs to infer continuities between the fragments, and to fill in the gaps appropriately. For example, we generally see branches of trees when they are partly hidden by foliage, and we infer which of the visible segments were derived from the same branch. When we make such inferences, we are employing the principles of good continuation and closure, since the gaps between segments of a branch can be filled in perceptually to produce a smooth contour.
The Kanizsa triangle shown in Figure 3.9 is an example of an illusory contour that is produced when the visual system fills in gaps so that an object is perceived as an integrated whole. We interpret this figure as a white triangle that occludes other objects, in accordance with the principles of good continuation and closure.15
As a related effect, our hearing mechanism is continually attempting to restore lost information. Imagine that you are talking with a friend on a busy street, and the passing traffic intermittently drowns out the sound of his voice. To follow (p.56) what your friend is saying, you need to infer continuity between the fragments of his speech that you can hear, and to fill in the fragments that you missed hearing. We have evolved mechanisms to perform these tasks, and we employ them so readily that it’s easy to produce illusions based on them. In other words, we can easily be fooled into “hearing” sounds that are not really there.
One way to create continuity illusions is to have a softer sound intermittently replaced by a louder one; this creates the impression that the softer sound is continually present. George Miller and Joseph Licklider presented a tone in alternation with a louder noise, with each sound lasting a twentieth of a second. Listeners reported that the tone appeared to continue right through the noise.16 The authors obtained similar effects using lists of words instead of tones, and they described such illusions as like viewing a landscape through a picket fence, where the pickets interrupt the view at intervals, and the landscape appears continuous behind the pickets.
Continuity effects can also be produced by signals that vary in time. Gary Dannenbring at McGill University created a gliding tone that alternately rose and fell in pitch. He then interrupted the glide periodically. When only silence intervened between portions of the glide, as in Figure 3.10(a), listeners heard a (p.57) sequence of isolated V-shaped patterns. Yet when loud noise bursts were inserted during each gap, as in Figure 3.10(b), they instead heard a single, continuously rising and falling sound that appeared to glide back and forth right through the noise.17
A particularly convincing version of the continuity illusion was created by Yoshitaka Nakajima in Japan, and can be heard in the “Continuity Illusion” module. First there is a long tone with a gap in the middle, and the gap is clearly heard. Then a loud complex tone is presented alone. Finally, the long tone with the gap is again presented, this time with the complex tone inserted in the gap, and the long tone now appears continuous.
Speech sounds are also subject to illusory restoration effects, and these are particularly strong when a meaningful sentence establishes a context. Richard Warren and his colleagues recorded the sentence “The state governors met with their respective legislators in the capital city.” Then they deleted the first “s” in the word “legislators” and replaced it with a louder sound, such as a cough, and found that the sentence still appeared intact to listeners. Even after hearing the altered sentence several times, the listeners didn’t perceive any part of it as missing, and they believed that the cough had been added to the recording. Even more remarkably, when they were told that the cough had replaced some portion of the speech, they couldn’t locate where in the speech it had occurred. Yet when the missing speech sound left a silent gap instead, the listeners had no difficulty in determining the position of the gap.18
Similar effects occur in music. Takayuki Sasaki in Japan recorded familiar piano pieces, and replaced some of the tones with bursts of noise. Listeners heard the altered recordings as though the noise bursts had simply been added to them, and they generally couldn’t identify the positions of the noise bursts.19 Perceptual restorations of this type must occur frequently when listening to music in concert halls, where coughs and other loud noises produced by the audience would otherwise cause the music to appear fragmented.
The continuity illusion is used to good effect in Electronic Dance Music (EDM). The beat from the kick drum is sometimes accompanied by a subtle reduction in amplitude of other ongoing sounds. As a result of the continuity effect, the kick can be heard more clearly, while the other sounds still appear continuous.
(p.58) What characteristics of the substituted sound are conducive to producing illusory continuity? When a sound signal is interrupted by periods of silence, it’s unlikely that an extraneous factor has caused these interruptions, so we can reasonably assume that the interruption is occurring in the signal itself. But loud noise bursts that replace the silent gaps can plausibly be interpreted as extraneous sounds that are intermittently drowning out the signal. For this reason, illusory continuity effects occur best when the interposed sound is loud, when it is strictly juxtaposed with the signal, and when the transition between it and the signal is sufficiently abrupt as to avoid the conclusion that the signal itself is intermittent.
However, these conditions are not necessary for continuity effects to occur. For example, guitar tones are characterized by rapid increases and decreases in loudness. In music played on such instruments, when the same tone is rapidly repeated many times, and it’s intermittently omitted and replaced by a different tone, listeners generate the missing tone perceptually. Many such examples occur in guitar music of the nineteenth and twentieth centuries, such as Tárrega’s Recuerdos de la Alhambra (see Figure 3.11), and Barrios’s Una Limosna por el Amor de Dios. Here the strong expectations set up by the rapidly repeating tones (called ‘tremolo’) cause the listener to “hear” the missing tones, even though they are not being played. The “Recuerdos de la Alhambra” module presents such an example.
Another approach to the perception of musical streams exploits the use of orchestral sound textures to create ambiguous images. In the late nineteenth and early twentieth centuries, composers such as Debussy, Strauss, and Berlioz used this approach to remarkable effect. Later composers such as Varèse, Penderecki, and Ligeti made sound textures the central elements of their compositions. Rather than following the principles of conventional tonal music, they focused on sound masses, clouds, and other textural effects, with extraordinary results. Ligeti’s Atmospheres is a fine example; it was made particularly famous by the mysterious ambience it created in Stanley Kubrick’s movie 2001: A Space Odyssey. As Philip Ball wrote:
The main effect of this harmonic texture is to weave all the voices together into one block of shifting sound, a rumbling, reverberant, mesmerizing mass. Faced with such acoustic complexity, the best the mind can do is to lump it all together, creating a single “object” that, in Bregman’s well-chosen words, is “perceptually precarious”—and therefore interesting.20
(p.59) Such sound objects can appear to be in a perpetual state of transformation. In this way they are like visual arrays that seem to be constantly changing. Faced with ambiguity, the perceptual system is constantly forming organizations and then replacing them with new ones. Most of us have gazed into the sky and seen changing images in the clouds—strange edifices, faces, figures, hills, and valleys. These perceptual impressions have fascinated poets and artists, who sometimes deliberately incorporated images of objects in paintings of clouds or landscapes. As described by Shakespeare in Antony and Cleopatra:
- Sometime we see a cloud that’s dragonish,
- A vapour sometime like a bear or lion,
- A tower’d citadel, a pendant rock,
- A forked mountain, or blue promontory
- With trees upon’t, that nod unto the world,
- And mock our eyes with air.21
(p.60) We have explored a number of organizational principles that are fundamental to the hearing process, and have focused on illusions that result from these principles. We next explore a different class of musical illusions. These result from sequences of tones that are ambiguous in principle. The tones give rise to pitch circularity effects, and also to an intriguing illusion—the tritone paradox—that is experienced quite differently by different listeners. Perception of the tritone paradox is strongly influenced by the pitch ranges of speech that are heard in childhood—and so provides a link between the brain mechanisms underlying speech and music.
(1.) The Rubin vase illusion was created by Edgar Rubin around 2015. Earlier versions appear in eighteenth-century French prints, which are composed of portraits in which the outlines of vases facing each other appear on either side.
(15.) The Kanizsa triangle was first published by Gaetano Kanizsa, an Italian psychologist, in 1955.
(21.) Shakespeare, Antony and Cleopatra, Act 4, Scene 14.