Fwd: [Accessibility] TTS API document + introduction
Janina Sajka
janina at freestandards.org
Mon Mar 6 07:34:27 PST 2006
Thanks Olaf--and thanks also to Hynek.
Just one item ... We have not set a teleconference for this week
Wednesday 8 March. The time slot is certainly available, but I cannot
participate as I will be flying back to the U.S. at that time.
Olaf Jan Schmidt writes:
> Hi!
>
> For those who are not subscribed to accessibility at freedesktop.org I am
> forwarding the latest draft for the joint TTS API that we need for reworking
> kttsd and SpeechDispatcher.
>
> Hynek has written an introduction that summarises our approach. I hope it
> helps our discussion on Wednesday.
>
> Please cc the freedesktop.org list in your comments, because I want to make
> sure that there is at least one place where all the email discussion goes.
>
> Olaf
>
> --
> Olaf Jan Schmidt, KDE Accessibility co-maintainer, open standards
> accessibility networker, Protestant theology student and webmaster of
> http://accessibility.kde.org/ and http://www.amen-online.de/
Content-Description: Hynek Hanke <hanke at brailcom.org>: [Accessibility] TTS API document + introduction
> From: Hynek Hanke <hanke at brailcom.org>
> To: "Accessibility, Freedesktop" <accessibility at freedesktop.org>
>
>
> Hello,
>
> here is the latest version of the TTS API document with a new
> introduction section trying to summarize the previous private and
> public discussions on this topic. Comments are welcomed.
>
> With regards,
> Hynek Hanke
>
> Changes
> =======
>
> * Introduction was written (clarification of intent, scope)
>
> * Clarification of the meaning of MUST HAVE, SHOULD HAVE
>
> * Point (4.11) was removed as not directly important for accessibility
> (after discussions with Willie Walker who requested the point)
>
> * Point (4.13) was removed because its purpose is not clear.
> Even if this functionality is needed, the 's' SSML element is
> not a good way to do it.
>
> * Reformulation of (1.4), added 'temporarily' to (3.2), 'software
> synthesizers' in (4.4), terminology in (4.13),
> clarification in (B.1.3/2) and (B.1.3/3)
>
>
> Common TTS Driver Interface
> ============================
> Document version: 2006-03-06
>
> The purpose of this document is to define a common low-level interface
> to access the various speech synthesizers on Free Software and Open
> Source platforms. It is designed to be used by applications that do
> not need the advanced functionality like message management and by
> applications providing high-level interfaces (such as Speech
> Dispatcher, Gnome Speech, KTTSD etc.) The purpose of this document is
> not to define and force an API on the speech synthesizers. The
> synthesizers might use different interfaces that will be handled by
> their drivers.
>
> This interface will be implemented by a simple layer integrating
> available speech synthesis drivers and in some cases emulating some of
> the functionality missing in the synthesizers themselves.
>
> Advanced capabilities not directly related to speech, like message
> management, priorities, synchronization etc. are left out of scope for
> this low-level interface. They will be dealt with by higher-level
> interfaces. (It is desirable to be able to agree on a common
> higher-level interface too, but agreeing first on a low-level
> interface is an easier task to accomplish.) Such high-level interface
> (not necessarily limited to speech) will make good use of the already
> existing low-level interface.
>
> It is desirable that simple applications can use this API in a simple
> way. However, the API must also be complex enough so that it doesn't
> limit more advanced applications in use of the synthesizers.
>
> The first part (A) of this document describes the requirements
> gathered between projects like Gnome Speech, Speech Dispatcher, KTTSD,
> Emacspeak and SpeakUp of what they might reasonably expect from speech
> synthesis on a system. These requirements are not meant to be the
> requirements on the synthesizers, although they might be a guide to
> synthesizer authors as they plan future features and capabilities for
> their products. Parts (B) and (C) describe the XML/SSML markup in use
> and part (D) defines the interface.
>
> Temporary note: The goal of this interface is real implementation in
> foreseeable future. The next step will be merging the available
> engine drivers in the various accessibility projects under this
> interface and using this interface. For this reason, we need all
> accessibility projects who want to participate in this common effort
> to make sure all their requirements on a low-level speech output
> interface are met and that such an interface is defined that it is
> suitable for their needs.
>
> Temporary note: Any comments about this draft are welcome and
> useful. But since the goal of these requirements is real
> implementation, we need to avoid endless discussions and keep the
> comments focused and to the point.
>
> A. Requirements
>
> This section defines a set of requirements on the interface and on
> speech synthesizer drivers that need to support assistive
> technologies on free software platforms.
>
> 1. Design Criteria
>
> The Common TTS Driver Interface requirements will be developed
> within the following broad design criteria:
>
> 1.1. Focus on supporting assistive technologies first. These
> assistive technologies can be written in any programming language
> and may provide specific support for particular environments such
> as KDE or GNOME.
>
> 1.2. Simple and specific requirements win out over complex and
> general requirements.
>
> 1.3. Use existing APIs and specs when possible.
>
>
> 1.4 All language dependent functionality with respect to text
> processing for speech synthesis should be covered in the
> synthesizers or synthesis drivers, not in applications.
>
> 1.5. Requirements will be categorized in the following priority
> order: MUST HAVE, SHOULD HAVE, and NICE TO HAVE.
>
> The priorities have the following meanings with respect
> to the drivers available under this API:
>
> MUST HAVE: All drivers must satisfy this requirement.
>
> SHOULD HAVE: The driver will be usable without this feature, but
> it is expected the feature is implemented in all drivers
> intended for serious use.
>
> NICE TO HAVE: Optional features.
>
> Regardless of the priority, full interface will be provided
> by the API, even when the given functionality is actually not
> implemented behind the interface.
>
> 1.6. Requirements outside the scope of this document will be
> labelled as OUTSIDE SCOPE.
>
> 1.7. An application must be able to determine if SHOULD HAVE
> and NICE TO HAVE features are supported for a given driver.
>
>
> 2. Synthesizer Discovery Requirements
>
> 2.1. MUST HAVE: An application will be able to discover all speech
> synthesizer drivers available to the machine.
>
> 2.2. MUST HAVE: An application will be able to discover all possible
> voices available for a particular speech synthesizer driver.
>
> 2.3. MUST HAVE: An application will be able to determine the
> supported languages, possibly including also a dialect or a
> country, for each voice available for a particular speech
> synthesizer driver.
>
> Rationale: Knowledge about available voices and languages is
> necessary to select proper driver and to be able to select a
> supported language or different voices in an application.
>
> 2.4. MUST HAVE: Applications may assume their interaction with the
> speech synthesizer driver doesn't affect other operating system
> components in any unexpected way.
>
> 2.5. OUTSIDE SCOPE: Higher level communication interfaces
> to the speech synthesizer drivers. Exact form of the
> communication protocol (text protocol, IPC etc).
>
> Note: It is expected they will be implemented by particular
> projects (Gnome Speech, KTTSD, Speech Dispatcher) as wrappers
> around the low-level communication interface defined below.
>
>
> 3. Synthesizer Configuration Requirements
>
> 3.1. MUST HAVE: An application will be able to specify the default
> voice to use for a particular synthesizer, and will be able to
> change the default voice in between `speak' requests.
>
> 3.2. SHOULD HAVE: An application will be able to specify the default
> prosody and style elements for a voice. These elements will match
> those defined in the SSML specification, and the synthesizer may
> choose which attributes it wishes to support. Note that prosody,
> voice and style elements specified in SSML sent as a `speak'
> request
> will temporarily override the default values.
>
> 3.3. SHOULD HAVE: An application should be able to provide the
> synthesizer with an application-specific pronunciation lexicon
> addenda. Note that using `phoneme' element in SSML is another way
> to accomplish this on a very localized basis, and will override
> any pronunciation lexicon data for the synthesizer.
>
> Rationale: This feature is necessary so that the application is
> able to speak artificial words or words with explicitly modified
> pronunciation (e.g. "the word ... is often mispronounced as ...
> by foreign speakers").
>
> 3.4. MUST HAVE: Applications may assume they have their own local
> copy of a synthesizer and voice. That is, one application's
> configuration of a synthesizer or voice should not conflict with
> another application's configuration settings.
>
> 3.5. MUST HAVE: Changing the default voice or voice/prosody element
> attributes does not affect a `speak' in progress.
>
> 4. Synthesis Process Requirements
>
> 4.1. MUST HAVE: The speech synthesizer driver is able to process
> plain text (i.e. text that is not marked up via SSML) encoded in
> the UTF-8 character encoding.
>
> 4.2. MUST HAVE: The speech synthesizer driver is able to process
> text formatted using extended SSML markup defined in part B of
> this document and encoded in UTF-8. The synthesizer may choose
> to ignore markup it cannot handle or even to ignore all markup
> as long as it is able to process the text inside the markup.
>
> 4.3. SHOULD HAVE: The speech synthesizer driver is able to properly
> process the extended SSML markup defined in the part B. of this
> document as SHOULD HAVE. Analogically for NICE TO HAVE.
>
> 4.4. MUST HAVE: An application must be able to cancel a synthesis
> operation in progress. In case of hardware synthesizers, or
> synthesizers that produce their own audio, this means cancelling
> the audio output as well.
>
> 4.5. MUST HAVE: The speech synthesizer driver must be able to
> process long input texts in such a way that the audio output
> starts to be available for playing as soon as possible. An
> application is not required to split long texts into smaller
> pieces.
>
> 4.6. SHOULD HAVE: The speech synthesizer driver should honor the
> Performance Guidelines described below.
>
> 4.7. NICE TO HAVE: It would be nice if a synthesizer were able to
> support "rewind" and "repeat" functionality for an utterance (see
> related descriptions in the MRCP specification).
>
> Rationale: This allows moving over long texts without the need to
> synthesize the whole text and without loosing context.
>
> 4.8. NICE TO HAVE: It would be nice if a synthesizer were able to
> support multilingual utterances.
>
> 4.9. SHOULD HAVE: A synthesizer should support notification of
> `mark' elements, and the application should be able to align
> these events with the synthesized audio.
>
> 4.10. NICE TO HAVE: It would be nice if a synthesizer supported
> "word started" and "word ended" events and allowed alignment of
> the events similar to that in 4.9.
>
> Rationale: This is useful to update cursor position as a displayed
> text is spoken.
>
> 4.11. REMOVED (not directly important for accessibility)
>
> The former version: It would be nice if a synthesizer supported
> timing information at the phoneme level and allowed alignment of
> the events similar to that in 4.9. Rationale: This is useful
> for talking heads.
>
>
> 4.12. SHOULD HAVE: The application must be able to pause and resume
> a synthesis operation in progress while still being able to handle
> other synthesis requests in the meantime. In case of hardware
> synthesizers, this means pausing and if possible resuming the
> audio output as well.
>
> 4.13. REMOVED (not clear purpose, the SSML specs do not require
> the 's' element to work this way)
>
> The synthesizer should not try to split the
> contents of the `s' SSML element into several independent pieces,
> unless required by a markup inside.
>
> Rationale: An application may have better information about the
> synthesized text and perform its own splitting of sentences.
>
> 4.14. OUTSIDE SCOPE: Message management (queueing, ordering,
> interleaving, etc.).
>
> 4.15. OUTSIDE SCOPE: Interfacing software synthesis with audio
> output.
>
> 4.16. OUT OF SCOPE: Specifying the audio format to be used by a
> synthesizer.
>
> 5. Performance Guidelines
>
> In order to make the speech synthesizer driver actually usable with
> assistive technologies, it must satisfy certain performance
> expectations. The following text provides a clue to the driver
> implementors to get a rough idea about what is needed in practice.
>
> Typical scenarios when working with a speech enabled text editor:
>
> 5.1. Typed characters are spoken (echoed).
>
> Reading of the characters and cancelling the synthesis must be
> very fast, to catch up with a fast typist or even with
> autorepeat. Consider a typical autorepeat rate 25 characters per
> second. Ideally within each of the 40 ms intervals synthesis
> should begin, produce some audio output and stop. To perform
> all these actions within 100 ms (considering a fast typist and
> some overhead of the application and the audio output) on a
> common hardware is very desirable.
>
> Appropriate character reading performance may be difficult to
> achieve with contemporary software speech synthesizers, so it may
> be necessary to use techniques like caching of the synthesized
> characters. Also, it is necessary to ensure there is no initial
> pause ("breathing in") within the synthesized character.
>
> 5.2. Moving over words or lines, each of them is spoken.
>
> The sound sample needn't be available as quickly as in case of the
> typed characters, but it still should be available without clearly
> noticeable delay. As the user moves over the words or lines, he
> must hear the text immediately. Cancelling the synthesis of the
> previous word or line must be instant.
>
> 5.3. Reading a large text file.
>
> In such a case, it is not necessary to start speaking instantly,
> because reading a large text is not a very frequent operation.
> One second long delay at the start is acceptable, although not
> comfortable. Cancelling the speech must still be instant.
>
>
> B. XML (extended SSML) Markup in Use
>
> This section defines the set of XML markup and special
> attribute values for use in input texts for the drivers.
> The markup consists of two namespaces: 'SSML' (default)
> and 'tts', where 'tts' introduces several new attributes
> to be used with the 'say-as' element and a new element
> 'style'.
>
> If an SSML element is supported, all its mandatory attributes
> by the definition of SSML 1.0 must be supported even if they
> are not explicitly mentioned in this document.
>
> This section also defines which functions the API
> needs to provide for default prosody, voice and style settings,
> according to (3.2).
>
> Note: According to available information, SSML is not known
> to suffer from any IP issues.
>
>
> B.1. SHOULD HAVE: The following elements are supported
> speak
> voice
> prosody
> say-as
>
> B.1.1. These SPEAK attributes are supported
> 1 (SHOULD HAVE): xml:lang
>
> B.1.1. These VOICE attributes are supported
> 1 (SHOULD HAVE): xml:lang
> 2 (SHOULD HAVE): name
> 3 (NICE TO HAVE): gender
> 4 (NICE TO HAVE): age
> 5 (NICE TO HAVE): variant
>
> B.1.2. These PROSODY attributes are supported
> 1 (SHOULD HAVE): pitch (with +/- %, "default")
> 2 (SHOULD HAVE): rate (with +/- %, "default")
> 3 (SHOULD HAVE): volume (with +/- %, "default")
> 4 (NICE TO HAVE): range (with +/- %, "default")
> 5 (NICE TO HAVE): 'pitch', 'rate', 'range'
> with absolute value parameters
>
> Note: The corresponding global relative prosody settings
> commands (not markup) in TTS API represent the percentage
> value as a percentage change with respect to the default
> value for the given voice and parameter, not with respect
> to previous settings.
>
>
> B.1.3. The SAY-AS attribute 'interpret-as'
> is supported with the following values
>
> 1 (SHOULD HAVE) characters
> The format 'glyphs' is supported.
>
> Rationale: This provides capability for spelling.
>
> 2 (SHOULD HAVE) tts:char
> Indicates the content of the element is a single
> character and it should be pronounced as a character.
> The element's contents (CDATA) should only contain
> a single character.
>
> This is different than the interpret-as value "characters"
> described in B.1.3.1. While "characters" is intended
> for spelling words and sentences, "tts:char" means
> pronouncing the given character (which might be subject
> to different settings, as for example using sound icons to
> represent symbols).
>
> If more than one character is present as the contents
> of the element, this is considered an error.
>
> Example:
> <speak>
> <say-as interpret-as="tts:char">@</say-as>
> </speak>
>
> Rationale: It is useful to have a separate attribute
> for "single characters" as this can be used in TTS
> configuration to distinguish the situation when
> the user is moving with cursor over characters
> from the situation of spelling. As well as in other
> situations where the concept of "single character"
> has some logical meaning.
>
> 3 (SHOULD HAVE) tts:key
> The content of the element should be interpreted
> as the name of a keyboard key or combination of keys. See
> section (C) for possible string values of content of this
> element. If a string is given which is not defined in section
> (C), the behavior of the synthesizer is undefined.
>
> Example:
> <speak>
> <say-as interpret-as="tts:char">shift_a</say-as>
> </speak>
>
> 4 (NICE TO HAVE) tts:digits
> Indicates the content of the element is a number.
> The attribute "detail" is supported and can take a numerical
> value, meaning how many digits should the synthesizer group
> for reading. The value of 0 means the number should be
> pronounced as a whole appropriate for the language, while any
> non-zero value means that a groups of so many digits should be
> formed for reading, starting from left.
>
> Example: The string "5431721838" would normally be read
> as "five billion four hundred thirty seven million ..." but
> when enclosed in the above say-as with detail set to 3, it
> would be read as "five hundred forty three, one hundred
> seventy two etc." or "five, four, three, seven etc." with
> detail 1.
>
> Note: This is an extension to SSML not defined in the
> format itself, introduced under the namespace 'tts' (as
> allowed in SSML 'say-as' specifications).
>
>
> B.2. NICE TO HAVE: The following elements are supported
> mark
> s
> p
> phoneme
> sub
>
> B.2.1. NICE TO HAVE: These P attributes are supported:
> 1 xml:lang
>
> B.2.2. NICE TO HAVE: These S attributes are supported
> 1 xml:lang
>
> B.3. SHOULD HAVE: An element `tts:style' (not defined in SSML 1.0)
> is supported.
>
> This element can occur anywhere inside the SSML document.
> It may contain all SSML elements except the element 'speak'
> and it may also contain the element 'tts:style'.
>
> It has two mandatory attributes 'field'
> and 'mode' and an optional string attribute 'detail'. The
> attribute 'field' can take the following values
> 1) punctuation
> 2) capital_letters
> defined below.
>
> If the parameter field is set to 'punctuation',
> the 'mode' attribute can take the following values
> 1) none
> 2) all
> 3) (NICE TO HAVE) some
> When set to 'none', no punctuation characters are explicitly
> indicated. When it is set to 'all', all punctuation characters
> in the text should be indicated by the synthesizer. When
> set to 'some', the synthesizer will pronounce those
> punctuation characters enumerated in the additional attribute
> 'detail' or will only speak those characters according to its
> settings if no 'detail' attribute is specified.
>
> The attribute detail takes the form of a string containing
> the punctuation characters to read.
>
> Example:
> <tts:style field="punctuation" mode="some" detail=".?!">
>
> If the parameters field is set to 'capital_letters',
> the 'mode' attribute can take the following values
> 1) no
> 2) spelling
> 3) (NICE TO HAVE) icon
> 4) (NICE TO HAVE) pitch
>
> When set to 'no', capital letters are not explicitly
> indicated. When set to 'spell', capital letters are
> spelled (e.g. "capital a"). When set to 'icon', a sound
> is inserted before the capital letter, possibly leaving
> the letter/word/sentence intact. When set to 'pitch',
> the capital letter is pronounced with a higher pitch,
> possibly leaving the letter/word/sentence intact.
>
>
> Rationale: These are basic capabilities well established
> in accessibility. However, SSML does not support them.
> Introducing this additional element does not break the
> possibility of outside applications to send valid SSML
> into TTS API.
>
> B.4. NICE TO HAVE: Support for the rest of elements and attributes
> defined in SSML 1.0. However, this is of lower priority than
> the enumerated subset above.
>
> Open Issue: In many situations, it will be desirable to
> preserve whitespace characters in the incoming document.
> Should we require the application to use the 'xml:space'
> attribute for the speak element or should we state 'preserve'
> is the default value for 'xml:space' in the root 'speak'
> element in this case?
>
> C. Key names
>
> Key name may contain any character excluding control characters (the
> characters in the range 0 to 31 in the ASCII table and other
> ``invisible'' characters), spaces, dashes and underscores.
>
> C.1 The recognized key names are:
> 1) Any single UTF-8 character, excluding the exceptions defined
> above.
>
> 2) Any of the symbolic key names defined bellow.
>
> 3) A combination of key names defined bellow using the
> '_' (underscore) character for concatenation.
>
> Examples of valid key names:
> A
> shift_a
> shift_A
> $
> enter
> shift_kp-enter
> control
> control_alt_delete
>
> C.2 List of symbolic key names
>
> C.2.1 Escaped keys
> space
> underscore
> dash
>
> C.2.2 Auxiliary Keys
> alt
> control
> hyper
> meta
> shift
> super
>
> C.2.3 Control Character Keys
> backspace
> break
> delete
> down
> end
> enter
> escape
> f1
> f2 ... f24
> home
> insert
> kp-*
> kp-+
> kp--
> kp-.
> kp-/
> kp-0
> kp-1 ... kp-9
> kp-2
> kp-enter
> left
> menu
> next
> num-lock
> pause
> print
> prior
> return
> right
> scroll-lock
> space
> tab
> up
> window
>
> D. Interface Description
>
> This section defines the low-level TTS driver interface for use by
> all assistive technologies on free software platforms.
>
> 1. Speech Synthesis Driver Discovery
>
> ...
>
> 2. Speech Synthesis Driver Interface
>
> ...
>
> Open Issue: Still not clear consensus on how to return the
> synthesized audio data (if at all). The main issue here is
> mostly with how to align marker and other time-related events
> with the audio being played on the audio output device.
>
> Proposal: There will be 2 possible ways to do it. The synthesized
> data can be returned to the application (case A) or the
> application can ask for them being played on the audio (which
> will not be the task of TTS API, but will be handled by
> another API) (case B).
>
> In (case A), each time the application gets a piece of audio
> data, it also gets a time-table of index marks and events
> in that piece of data. This will be done on a separate socket
> in asynchronous mode. (This is possible for software
> synthesizers only, however.)
>
> In (case B), the application will get asynchronous callbacks
> (they might be realized by sending a defined string over
> a socket, by calling a callback function or in some other
> way -- the particular way of doing it is considered an
> implementation detail).
>
> Rationale: Both approaches are useful in different situations
> and each of them provides some capability that the other one
> doesn't.
>
> Open Issue: Will the interaction with the driver be synchronous
> or asynchronous? For example, will a call to `speak'
> wait to return until all the audio has been processed? If
> not, what happens when a call to "speak" is made while the
> synthesizer is still processing a prior call to "speak?"
>
> Proposal: With the exception of events and index marks signalling,
> the communication will be synchronous. When a speak request
> is issued while the is still processing a prior call to speak
> and the application didn't call pause before, this is
> considered an error.
>
> E. Related Specifications
>
> SSML: http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
> (see requirements at the following URL:
>
> http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#ref-reqs)
>
> SSML 'say-as' element attribute values:
> http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/
>
> MRCP: http://www.ietf.org/html.charters/speechsc-charter.html
>
> F. Copying This Document
>
> Copyright (C) 2006 ...
> This specification is made available under a BSD-style license ...
>
> _______________________________________________
> accessibility mailing list
> accessibility at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/accessibility
> _______________________________________________
> Accessibility mailing list
> Accessibility at lists.freestandards.org
> http://lists.freestandards.org/cgi-bin/mailman/listinfo/accessibility
--
Janina Sajka Phone: +1.240.715.1272
Partner, Capital Accessibility LLC http://www.CapitalAccessibility.Com
Marketing the Owasys 22C talking screenless cell phone in the U.S. and Canada--Go to http://www.ScreenlessPhone.Com to learn more.
Chair, Accessibility Workgroup Free Standards Group (FSG)
janina at freestandards.org http://a11y.org
More information about the Accessibility
mailing list