[Accessibility] Fwd: TTS API interface description + reqs update

Sat Apr 29 05:46:56 PDT 2006

Hello,

I'm forwarding an email from accessibility at freedesktop.org, where
bellow I also give brief minutes of the TTS API group meeting on Monday
the 3rd of April.

Please reply to the Freedesktop conference, which is the official
place decided for the working on these specs. I'd like to keep
the comments at one place so that people are not confused
what was posted where and also so that we have an archive.

Thank you,
Hynek Hanke

Hello,

I've updated the requirements document with the latest suggestions
and tried to incorporate what we discussed on one FSG meeting and
a subsequent meeting of a subgroup specifically about TTS API.
Very brief meeting notes and changes are described bellow.

I've also written a draft for an interface definition that fits
these requirements. Since it was not longer possible to keep
the document in plain text for lack of structuring, lack of references,
and growing size of the document, I've converted it into TeXinfo.
I've chosen TeXinfo for ease of editing, but I've paid special
attention at marking things properly so that it will be possible
to convert it into DocBook in the feature if there will be such
a wish.

A simple project page with the document is now available here:
	http://www.freebsoft.org/tts-api
It also describes how to obtain the source from anonymous CVS.
If someone wants to have a CVS account there to be able to
work on the document, please email me privately.

You can find the document in various formats automatically
generated from CVS here
	http://www.freebsoft.org/doc/tts-api/

The main discussion and work is to be continued on the Freedesktop
mailing list.

Changes + additions
===================

* A BSD-style license.

* Draft of interface description was written.

* Notes about the draft explaining how some capabilities
can be reached were also written.

* The requirements on the API are now in Appendix A. Appendix D was
added with requirements on the synthesizers themselves (not the drivers)
It is a guideline for the synthesizer authors for what is really
needed from the synthesizer itself.

FSG subgroup teleconference about TTS API
(more changes and additions)
-----------------------------------------

Participants: Janina Sajka, Willie Walker, Olaf Schmidt,
Gary Cramblitt, George Kraft, Kirk Reiser, Milan Zamazal,
Jan Buchal, Hynek Hanke

* A concern was raised by Janina Sajka that the
point A.2.4 of the requirements is not specific enough and it should
include not blocking the audio device as an example. Willie Walker
suggested to change it to ,,inappropriate blocking of system
resources''. I did both changes.

* Willie Walker requested A.3.2 of the requirements to be more clear
in saying that settings provided by SSML will also be provided by
explicit functions in the API. I tried to change A.3.2 in this way.

* We discussed point A.4.5 of the requirements, speaking about
applications not being forced to split long text into smaller chunks
for performance reasons. We found out this probably better fits into
Performance Guidelines. I've made this change too and moved it to A.5.3,
modifying A.5.4. The point however remains, that the synthesis driver
only honors performance guidelines if the performance described is
achieved without the need for the application to split longer text into
smaller chunks.

* We also discussed if we should give some indications, perhaps concrete
numbers, of what contains a long text etc. Milan Zamazal however
suggested that this is not easy, as the time to process a text inside
a synthesizer is dependent not only on the size of the text, but also
on what it contains. A relatively short text containing tens of
characters might be under some circumstances much more difficult to
process than a text of several paragraphs. For this reason, it
would be difficult to give exact figures.

* We find a minor correction that A.3.5 was not mentioning the style
element settings.

* We discussed point A.4.7. On one hand, as pointed by Gary Cramblitt
and Olaf Schmidt, there is a need for the applications
to be able to accomplish repeat and rewinding by words
and sentences without doing their own sentence boundary detection.
I also raised a concern that when rewinding and repeating
long messages, when these capabilities are not somehow supported
inside the synthesizer, this might cause unnecessary bad performance
if the application needs to send the whole text again and wait
for the full audio up to the place of rewinding to be synthesized.
For long text, the application will typically not have the full
audio for the whole message available in advance.

On the other hand, a concern was raised (George Kraft or Kirk Raiser)
that we should not design a complicated API solely because of the
limitations of current hardware and that this functionality could
probably somehow be implemented based in events and index marking.
That would remove the need for the synthesizer to do complicated
processing of several messages at once, pausing etc.

After thinking about it and after more private discussions with
Gary Cramblitt, I proposed a solution in the interface description
which relies only on event callbacks and a capability of the synthesizer
to start at a given place in the sent text. Further, if the synthesizer
wants to remove the need of the applications to send the whole text
again and do the whole SSML parsing and syntax analysis again for
each rewind request, I've introduced a defer() function into the
API through which more advanced synthesizer supporting processing
of several messages can be asked to keep the useful data for a given
message for later processing.

Thus the result is that synthesizer would under the current version
not have to provide explicit repeat, rewind and context pause
functions. If they do not like to, they don't even need to be able
to handle more messages at once, they can work completely synchronous,
but still the requirements points are satisfied without any change
(in the sense it is possible for the application to achieve this
functionality). Please see also a note in Chapter 3.

* We decided points A.4.9-A.4.10 speaking about events and index
marks should be reworked as they were not clear enough. Willie
Walker suggested that apart from words, there should also
be sentence events. Please see their current version for more details.

* The group internally reached agreement that the capability to
report "message started" and "message ended" events is a MUST HAVE
so that the synthesizer is useful. I did a change to MUST HAVE in
(A.4.9). Other events and index marks are NICE TO HAVE. Please protest
if you do not agree.

* Gunnar Schmidt proposed that it should be possible to pause a message
both immediatelly or at a sentence boundary (again without the need
to do sentence boundary detection by the application). This is possible
with the proposed mechanism.

* Again we also discussed events and index marks in general and more
granular events like phoneme level events. Milan Zamazal and Olaf
Schmidt gave examples where the application can't rely that the index
marks will be reached in the order they were put in as in for example
"$100" might be spoken as "one hundred dolars" (the place of the
dolars word and sign changes). I clarified that phoneme level index
marks and events are currently left out for being of not so immediate
importance and being difficult to handle. The current mechanism in
the API draft reports both position of events in the audio and their
position in the original text (in number of characters from beginning).

* Janina Sajka requested for clarification whether the current API
addresses the need to have several synthesizers running at the same
time and switching through them as the user works on the machine,
which will be more and more common in the future. We reached agreement
that this API indeeds takes this into account by allowing multiple
drivers to be loaded.

* Also, Janina Sajka proposed to get in contact with W3C about our
extensions to SSML, if some other mechanism is proposed or the
extensions are eventually incorporated into the original SSML specs
in some form. I'm going to contact the W3C comitee about this.

* We discussed our further work on the API and we found out that
we more or less have agreement on the requirements and after the
API description is finished, we can start implementation of the
intermediate layer and of the drivers for it. Sadly, we only had
little time left, but after some later clarifications over email,
it is possible to say that: Brailcom is currently wiling to work
on it and we can hope for help from KDE. Gnome/Sun supports the
effort and wants to participate, but currently has hands full of work
on other very important parts of accessibility (AT-SPI, Orca etc.)
and so can only contribute to the developement of the specs,
not to implementation.

I hope I covered all important points discussed. If not, please
ammend me. For more detailed information, the participants can access
the audio recording of the teleconference:
	http://rednote.net/fsg/tts/2006apr03.ogg

I'll be thankful for any comments about the new interface draft
and about the updated requirements.

With regards,
Hynek Hanke