Research and Development

VOCALOID™

VOCALOID is a singer synthesizer application developed by Yamaha and released in 2003. Since that time, many companies have introduced products that make use of this application. VOCALOID has gained considerable popularity, and both amateurs and professionals have successfully used it to produce attention-grabbing songs and videos.

Development Background

Researchers have been trying to get computers to sing for a long time. Most people have seen the famous sequence—in the 1968 classic film 2001: A Space Odyssey—where HAL the computer, sings Daisy Bell (A Bicycle Built for Two). Perhaps not so many people are aware that the sequence were inspired by an actual 1961 synthesized performance of the same song on an IBM mainframe, implemented by a team that included Max Mathews—considered the father of computer music—and John Kelly. In the five decades since that time many researches have continued working on the technology. But in most cases the resulting voices have sounded mechanical, and the performances lack feeling. Technologists have dreamed of the day when a computer could finally sing like a real human singer.

While Yamaha has been working on synthesizer technologies for many years, our efforts to develop a singing synthesizer only began at about the turn of the millennium. Rather than building an experimental system that would run on a mainframe and take days to generate a performance, we focused instead on creating a product that would allow even an unsophisticated user to generate a singing voice at low cost and play it back in real time. Musicians at the that time were already able to create convincing instrument simulations using state-of-the-art synthesizer and sampling technologies, but high-quality simulation of that ultimate instrument—the human voice—remained elusive. Our goal from the start was to create commercial software that would implement this difficult technology in a way that would make it accessible to everyone.

Why Use Synthesized Singing?

Until recently, a musical performance generally required one musician for each instrument. To produce a big orchestra sound, you needed a full complement of musicians and instruments. But financial and spatial constraints often made such a lineup impractical. Then, with the emergence of electronic instrumentation, it became possible to get a big sound using fewer musicians and instruments—a single synthesizer could now be used to simulate numerous acoustic instruments, and each musician could take control of multiple electronic instruments. But humans were still required for all of the singing parts. Producers could get into trouble if they were short several singers for the chorus, or if a singer became unavailable for a performance. At Yamaha, we saw that we could help solve this problem by getting computers to sing. Synthetic singers come with additional benefits as well: they will sing the same song as many times as desired, always correctly, and always without complaint. Our goal was to design high-quality virtual singers that would offer their advantages.

The Arrival of VOCALOID

In 2003, we launched VOCALOID, the world's first serious commercial singing synthesizer—driving the technology out of the research world and into the real one. VOCALOID quickly earned plaudits for its sound clarity (the words were easy to hear), its naturalness (its close approximation to the human voice), and its superlative operability and ease of use. The US magazine Electronic Musician named VOCALOID the winner of its 2005 Editors Choice award for Most Innovative Product.

Why Use Synthesized Singing?

Until recently, a musical performance generally required one musician for each instrument. To produce a big orchestra sound, you needed a full complement of musicians and instruments. But financial and spatial constraints often made such a lineup impractical. Then, with the emergence of electronic instrumentation, it became possible to get a big sound using fewer musicians and instruments—a single synthesizer could now be used to simulate numerous acoustic instruments, and each musician could take control of multiple electronic instruments. But humans were still required for all of the singing parts. Producers could get into trouble if they were short several singers for the chorus, or if a singer became unavailable for a performance. At Yamaha, we saw that we could help solve this problem by getting computers to sing. Synthetic singers come with additional benefits as well: they will sing the same song as many times as desired, always correctly, and always without complaint. Our goal was to design high-quality virtual singers that would offer their advantages.

Features of VOCALOID Technology

Singer Library

To make voice synthesis a reality, we begin by recording a live singer in a studio. We don't record a typical song performance, however. Instead, we record what is needed to generate the full range of elements necessary to build a convincing virtual singer. In particular, we record many combinations of vowel and consonant sounds, and many combinations of different pronunciations (variations in nasal quality, etc.) and song lyrics.

The recorded data is then broken into sound fragments. These fragments are then further adjusted, edited, and refined into elements suitable for concatenation into a smooth sound. These elements are then stored to a database known as a singer library or singer database. Since each language uses its own distinct set of phonemes, we use different recording scripts and databases for each language we support: Japanese, English, and others that may follow.

VOCALOID products are designed and marketed by outside companies under license from Yamaha. The singer library constitutes the licensing company's propriety part of the product. The licensing company selects the virtual singer (library) and then creates the vocal parts for that singer.

Score Editor

The user utilizes the score editor to enter lyrics and notes, and to edit and adjust these as necessary to get the desired nuances. Input and editing is easy and intuitive, by means of an edit screen similar to a typical MIDI sequencer's piano-roll display. The user enters the lyrics as straight text; the software than automatically generates the corresponding phoneme sequence, activates the synthesis engine, and plays out the results. This user does not need to worry about the relationships between words and phonemes, as the software takes care of the conversion of text into phonemes.

VOCALOID is particularly strong in the way it allows users to make subtle changes in voice properties and to program detailed nuances. By utilizing these capabilities, the user can create vocals that have great expressive power and deep feeling. The data generated at the score editor is sent to the synthesis engine in the form of MIDI messages.

Non-vocal parts are recorded or else entered as data at a digital audio workstation (DAW). VOCALOID can work together with music creation software, and can also accept melodies input from a MIDI keyboard.

(Click to enlarge image)

Synthesis Engine

The synthesis engine concatenates the voice elements to generate the singing voice. The engine receives MIDI messages from the score editor; extracts the score data, lyrics, expression data, and other necessary information from these messages; retrieves the necessary sound elements from the singer library; and concatenates these elements to create the singing.

The engine automatically makes relevant adjustments during the concatenating process. For example, the engine will adjust the temporal position of element playback in cases where the original timing of consonant and vowel production would generate displeasing sounds. Indeed, VOCALOID's natural sound is a direct result of the techniques it uses to smooth the concatenation of phonemes and reduce noises at the phoneme boundaries.

(Click to enlarge image)

Further Development Leads to VOCALOID Version 2

Following the 2003 market release of the first version of the software, Yamaha's development team turned their attention to implementing technical improvements in two areas: they wanted to get better sound quality from the already well-received synthesis engine, and they wanted to further improve the usability of the score editor screen. They were successful in both efforts. They achieved better quality by improving the engine's concatenating technologies, resulting in more realistic and smoother pronunciation and singing. And they improved the editing screen by attending carefully to user feedback and working almost two years to design a simpler interface that even beginners could understand. VOCALOID 2, the fruit of these efforts, was announced in 2007.

The Hatsune Miku application that was built on VOCALOID 2 became a major hit in Japan, leading to an Internet-driven boom in the creation and dissemination of Hatsune Miku products. It was very gratifying to see this excellent response to our uncompromising development efforts.

Practical Technologies

NetVOCALOID

NetVOCALOID is a service that runs the VOCALOID synthesizer engine on the server side, making singer synthesis available to network-connected devices. Users who do not own a Windows edition of VOCALOID can easily access VOCALOID features through this service.

NetVocaListener

NetVocaListener is a service that makes VocaListener functionality available over the network. The service is intended to be used in combination with VOCALOID or NetVOCALOID.

VocaListener is a technology that allows the user to supply a recorded voice as the model for a VOCALOID-synthesized singer. The user records the voice into a sound file; VocaListener then accesses the file and automatically estimates values for the VOCALOID parameters used to create the singer. Use of VocaListener is faster and requires less skill than use of the dedicated VOCALOID editor, making it much easier to achieve high-quality synthesis.

VOCALOID-flex

VOCALOID-flex is an improved version of VOCALOID that includes the capability of synthesizing natural-sounding speech. Good speech synthesis is even harder to achieve than good singing synthesis, as it requires the ability to make very subtle changes in the sound. For this reason, VOCALOID-flex is designed to deliver robust editing of basic sound elements (phoneme sound structure, phoneme length, etc.) and intonations (pitches, stresses, and more). These new capabilities are not available with previous VOCALOID versions. The software can deliver utterances that closely approximate human speech, and allows user to add accents and intonations that can simulate a wide range of dialectical speech patterns.

Return to Top