The objective of the Pronunciation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis. This document defines a standard mechanism to allow content authors to include spoken presentation guidance in HTML content.
Accurate, consistent pronunciation and presentation of content spoken by text-to-speech (TTS) synthesis is an essential requirement in education, communication, entertainment, and other domains. From helping to teach spelling and pronunciation in different languages, to reading learning materials or new stories, TTS has become a vital technology for providing access to digital content on the web, through mobile devices, and now via voice-based assistants. Organizations such as educational publishers and assessment vendors are looking for a standards-based solution to enable authoring of spoken presentation guidance in HTML which can then be consumed by assistive technologies (AT) and other applications that utilize text to speech synthesis (TTS) for rendering of content. Historically, efforts at standardization (e.g. SSML or CSS Speech) have not led to broad adoption of any standard by user agents, authors, or AT; what has arisen are a variety of non-interoperable approaches that meet specific needs for some applications. This explainer document presents the case for improving spoken presentation on the Web and how a standards-based approach can address the requirements.
This is a proposal for a mechanism to allow content authors to include spoken presentation guidance in HTML content. Such guidance can be used by AT (including screen readers and read aloud tools) and voice assistants to control TTS synthesis. A key requirement is to ensure the spoken presentation content matches the author's intent and user expectations.
The challenge is integrating pronunciation content into HTML so that it is easy to author, does not "break" content, and is straightforward for consumption by AT, voice assistants, and other tools that produce spoken presentation of content.
Several classes of AT users depend upon spoken rendering of web content by TTS synthesis. In contexts such as education, there are specific expectations for accuracy of spoken presentation in terms of pronunciation, emphasis, prosody, pausing, etc.
Correct pronunciation is also important in the context of language learning, where incorrect pronunciation can confuse learners.
In practice, the ecosystem of devices used in classrooms is broad, and each vendor generally provides their own TTS engines for their platforms. Ensuring consistent spoken presentation across devices is a very real problem, and challenge. For many educational assessment vendors, the problem necessitates non-interoperable hacks to tune pronunciation and other presentation features, such as pausing, which itself can introduce new problems through inconsistent representation of text across speech and braille.
It could be argued that continual advances in machine learning will improve the quality of synthesized speech, reducing the need for this proposal. Waiting for a robust solution that will likely still not fully address our needs is risky, especially when an authorable, declarative approach may be within reach (and wouldn't preclude or conflict with continual improvement in TTS technology).
The current situation:
With the growing consumer adoption of voice assistants, user expectations for high quality spoken presentation is growing. Google and Amazon both encourage application developers to utilize SSML to enhance the user experience on their platforms, yet Web content authors do not have the same opportunity to enhance the spoken presentation of their content.
Finding a solution to this need can have broader benefit in allowing authors to create web content that presents a better user experience if the content is presented by voice assistants.
<span> with ARIA attributes)