11/19/2023 0 Comments Hal 9000 computer soundsImagine you want a computer to match a square, a simple shape with four sides. Using the same example as above, the way ‘KA’ phoneme is uttered at a lower pitch by men and a higher pitch by women, or fast-talking northerners versus an extended southern drawl results in different digital patterns.Īdapting algorithms to account for these variabilities is challenging. The diversity of human voices and speech patterns, while amazing to cherish, is a larger technical problem in itself. It’s not what you say but how you say it: In the alternate, if the phoneme ‘KA’ is split across two time-slices, ‘iK’ in one time-slice, and ‘Aa’ in the 2nd time-slice, the matching of the ‘KA’ phoneme becomes impossible. In the analog to digital conversion, if the entire ‘KA’ phoneme is captured in a single time-slice, then pattern-matching it to ‘KA’ is possible. To get a sense of the complexities, let’s examine the word ‘cat.’ Focus on the first phoneme, the ‘KA’ sound. However, not everyone speaks clearly, concisely, and with adequate pauses to ensure clean phoneme matching. The vast computational resources available to voice technology processing today make short-order of pattern matching. These time-slices, now available as discrete numerical values, are then pattern matched to the appropriate phonemes. The faster the sampling (1 sec, 100 msec, 10 msec,…), the more precise is the digital representation. Or almost, but not exactly, one.Ĭoming back to VT, the process of analog to digital conversation requires sampling (measuring and recording) the sound waves into small time-slices. Is it 0.999999… or 1.0?Ī classic example for exposing the imprecise nature of analog to digital conversion is the thirds problem. The analog waves of sound are digitized, and herein lies the first complexity. Sound is a mix of Pitch, Amplitude (loudness), and Rate (speed).Īs the first step in VT, the sound must get into the computer. Like any sound, it is a combination of simple physics. The human voice is from the sound produced by the vocal tract. “This mission is too important for me to allow you to jeopardize it.” HAL 9000, 2001 – A Space Odyssey. That’s a lot to parse, so let’s break it up with examples to better understand the complexity. Context is the result of a myriad of options for presenting the literal meaning of phrases and/or sentences and finally the contextual meaning of phrases and/or sentences. The language aspect of VT deals with the higher-level abstractions of chaining phonemes into words, words into phrases and/or sentences. The acoustics part of VT is the piece that takes in sound as an input to the process and outputs phonemes or fundamental linguistic units. The process of applying compute resources and algorithms to capture, recreate and ultimately mimic a seemingly intuitive process exposes just how complex speaking and listening comprehension can be.Īt a high-level, there are three parts to Voice Technology (VT) – And the mission of creating a fully interactive, computer-based voice experience remains. Today, consumers are exposed to varying degrees of voice technology. In 1968, the seminal movie 2001: A Space Odyssey, envisioned a fully-aware computer system capable of nuanced conversation as well as self-awareness. From early grunts to rhythmic iambic pentameter to lyrical music renditions of the spoken word, its very antiquity and various evolutions are rich with complexity. Voice interaction is the primordial human interaction.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |