[Accessibility] Accessibility Conference January 27, 2005 - Speech server

john goldthwaite jgoldthwaite at yahoo.com
Wed Feb 16 14:54:13 PST 2005


Accessibility Conference January 27, 2005 - Speech
server

Mark Mulkahy
Doing this session in discussion format.  Agenda has 3
sessions- API for TTS, Interfacing with Audio Servers
and Speech audio, console and terminal environment. 
Will start with the latter which we skipped yesterday
to give more time for the AT-SPI discussion.

How do we prioritize audio- have multiple sources-
tts, VoIP phone, messages.  When I get an internet
phone call maybe I want it to mute everything or just
tts.  How to define basic requirements.  What do we
need from the console for speech?

Janina- we need to be able to define our tts as a
privileged audio source so that it will always speak
if we want that.  Low Latency is very important to
audio- need quick start and shut up.  It matters more
than if your BBC audio stream starts late.

Larry- to augment, tts should be considered a special
class intrinsically different from other audio.  Also,
useful if we can define quick.  Define if you should
get it all before starting to talk.  Third, tts users
must be able to turn intelligible speech to high
rates.  Define what fast is.  Also would like multiple
languages,  set of voice characteristics so we have
distinct voices for identification of different
purpose voice messages.

-Hynek- lets not mix tts with audio service.  I was
doing some experiments,  low latency means 20-40
milliseconds.  
Frank- need much faster – 1.3 milliseconds – for
speech shut up time 1.3 milliseconds is the fastest we
can do in the kernel and there is no reason.  64
frames per buffer.  As far as start up, it is not that
simple since processing is involved.

Will- in terms of start up, Bell research said latency
of more than 100 millisecond makes conversation
difficult.  What kind of audio classes should we have-
start and restart after other message.  Need to
enumerate.  As far as speech being always the
priority, there are cases where you want to hear
system prompts at the same time.

Janina- flexibility and user configurability are what
we want.  Yes mixing streams may be what is.   Do we
mix or do we queue?  It should be up to the user to
decide whether they mix or not.  

Mark- what we are talking about is predicated on being
able to mix the streams.  Where do hardware
synthesizers fit into this?  While they are not part
of the audio, you may want to synchronize the tts and
audio.  E.g.- reading a web page with hardware synth

Hynek- synchronizing in hardware synthesis can be
difficult.  I looked at both,  hardware synthesis are
useful during boot up but seem to be no need after
that.  They are very expensive, lack some advanced
speech (don’t support sound signals),  are not
extensible.  They are not running free software so its
not possible to synchronize.  Sometimes they support
languages that are not supported in software.  Latency
of hardware synthesis is about the same for hardware
and software.  Latency in Festival is mainly due to
network issues.  No reason to put too much work in to
hardware.

Janina-  there maybe problems in hardware synthesizers
but there is also problems with software synthesizer. 
There are few open software speech synthesizers. 
We’ve got to get the requirements,

Mark- we have about 20 min. and would like to go back
to the topic of audio servers.  We have less than
perfect options.  We have esound, libALSA, Jack, ARTS,
MAS, and OSS (which doesn’t mix and is deprecated in
Linux)/  It sounds like people are leaning toward
libALSA.  What do we get and not get libALSA? And Jack

Frank- JACK isn’t a place for mixing but it pass om. 
Usually an audio application like Jammin are clients
to the server.  Other things like snd.  JACK can mix
All these things together.  It works fine if it’s the
first device.  If you have a separate audio source it
isn’t easy to write.  Documentation is currently poor,
just recently just stabilized.  There are multiple
things to manage, some are audio streams.  When we are
talking about an desktop you talking about many levels

Janina- key issue, if you ask for an audio stream, you
should get it.  You should not get a message that it
can’t do it because another process owns the audio
device.  (esound using oss is block)

Hynek- I don’t think we will be happy with any of the
audio servers or streamers that exist.  We need to
look at the problem as a whole.  Possible- one widely
accepted audio server that has an interface that
people are happy with.  Put effort into developing
this servers.   

Second short of that, we could have separate audio
servers in Gnome and KDE which can cooperate with each
other.  Gnome is developing Gstreamer and they are
happy with it.  KDE is using ARTS but are looking for
another server for KDE 4.0.  The most important thing
is to put the requirements for accessibility and try
to get help from the wider Gnome and KDE communities. 
For sited users it is not obvious that the two servers
need to work together.

Frank- I like the idea but its not universal enough. 
What if we have another desktop in a year?   

Hynek- we all know that Gnome and KDE are the
important project that have to agree then the small
projects will join.  Simple audio output, don’t work
currently either.  Still using openosb and audio
dispatcher because it.

Frank- I’m not familiar with how you to write to desp.
 The OSS layer doesn’t get opened for multiple copies
of devdsp.  Why can’t you do multiple

Hynek-  the problem we are trying to solve is broader
than multi-audio stream.  If I get a stream from auk,
.. opening.   Solution for now but not a generic
solution.

Milan- as a sighted user, I would be happy if their
was only one audio server.   It is more complex.

Mark- I would like to go over the list of requirements
that we came up with.

Frank- I agree Hynek, not looking for a stop gap
measure.  A media controlling system is more
complicated because it needs to manage events.  It has
to know how to prioritize the messages- whether a mail
arrives message should be spoken or not.  It needs to
be more intelligent about the identity and purpose of
the audio streams.

Mark- we decided that none of the current solutions
meet our needs.  Want to push our requirements
upstream to the community.  FSG may say that ALSA is
the most accessible and recommend that people use
libALSA.  Currently don’t know what to tell
developers.

-	mixing audio
-	tts is a privileged source
-	low latency start and stop
-	flexibility in configuration
-	more information about sound card capabilities.
-	ALSA compatibility
-	Real time capabilities.

Hynek- I also think – real time audio. It is important
to know when sound ended, important for serialized. 
Being compatible with low level sound architectures
(OSS and ALSA currently)  

Frank- interesting point, real time is not the way I’d
describe it.  Need more than the end time, indexing is
required so that the audio server should know and pass
back to the application the current location of the
tts speech.

Janina- pin – we also need to talk about media for the
audio disabled.  These streams also need to support
captioning and in frame signing.  Want to be able
tweak equalization for HH, not just volume.

Larry- index markers are not sufficient, there needs
to mechanism for message to be sent back to the
originating application.  Want location, voice change,
other

Log-
Olaf-  > reading comment for ICR  - support for wave
format, there have been problems with it.
Gary crumbler on ICR
Pete Brunet on ICR
Kirk Reiser on phone
Marco – Suse
Janina Sajka
Frank
Ed price
Me
Mike Pacellio
Sebastian
Sebastian
Will Walker
Larry Weiss
Sandy Gabrielli
Hynek and 3 Czech folks.
 
Interfacing with Audio Servers

Action items for this break-out session:

* Identify scenarios for discussion of audio servers
and their capabilities
* Define requirements based on these scenarios
* Make a list of the currently available audio servers
* Make recommendations on which currently available
audio server option best meets our needs

Some topics worth considering:

* How does the issue of latency enter into discussions
surrounding speech and audio servers?
* How important is network transparency?
* What underlying sound systems must be supported
(OSS, ALSA, others...)?
* What dependencies/requirements are we willing to
live with?
* How well are current servers supported by standard
audio applications?


 
API for TTS

Action items for this break-out session:

* Identify current TTS API solutions
* Identify features, strengths, and weaknesses of the
various approaches
* Recommend extension of a current, or definition and
implementation of a new TTS API solution

Some questions for discussion:

* What speech engines should be supported in an
initial release?
* What dependencies are we willing to live with?
* What markup language should be used if any?
* What speech engine(s) will be used for acceptance
testing?
* What speech parameters should be modifiable?
* How can current speech engine status / capabilities
be queried?
* What amount of control over the audio device should
be provided?
* What speech-related callbacks and notifications are
needed?
* What programming languages should be supported?


As author of Gnome-speech, it know it has lots of
flaws.  Want to talk about what we need rather than
what is needed   

API’s:
Speech dispatcher from BrailleCom
Gnome speech
Kttsb
Prass
Speak up
Emacs speak

This isn’t a 

Gary- should we limit the discussion to requirements
for synthesizers rather than for tts as a whole?

Mark- want to enumerate the current synthesizers and
go over the strengths and weakness of each.  Also –
what synthesizers used for acceptance testing.

Do we want to recommend an API that can work with
multiple engines? 

Janina- may want to get Olaf to talk about freedesktop
project,  want to be able 
Olaf- On freedesktop.org discussion list we have a
fairly complete list of requirements.  I have been in
contact with brass maintainer about common format. 
Also brought speech dispatcher.  Don’t have feedback
from x, but think they will support it.   We are
working toward an API that can work with dynamically
loaded libraries, no dependencies.  KDE something is a
daemon.  Have looked in to speech mark up.

Will-  thanks for bring up the requirements.  Make up
an Api that meets our needs.
Gary- need for high level protocol,  queuing is for
higher level apps.
Kirk- from a developers perspective the most important
thing is an API.  Doesn’t matter what it is, just that
its consistent.  Not sure if you are talking about
have separate APIs for console and GUI or whether you
want one API for everybody.

Mark- maybe Olaf should enumerate the points that
haven’t been discussed-

Olaf – we generally had idea of passing speech markup
to the speech engine and having an API.  App must be
able to discover the speech devices and languages
available.  
Lists.freedesktop.org/ 
/accessibility/2005/January/0019xxx.html

Applications must be able to know their interacts with
the speech won’t effect the OS or apps in some
unexpected way.
Didn’t define an application interface for a single
application to use the synthesizer.  KDE and Gnome
have different models.  We might have a discussion on
those later.
It must be possible to set default parameters, set a
presentation lexicon, 
Changing default voice doesn’t affect voice in
process.

Frank- we need an interface where we can put a wrapper
around some of the tts devices we have.  Not every tts
and screenreader needs to support everything but they
need to be able to talk and so screenreader can
discover what is available.

Will- we separated the list into Must have, might
have, nice to have. 

Olaf-  transfer all the parameter information via Xml 
Might have to enhance it later.  There must be a way
to pro rate the features of the synthesizer.
Use SSml markup
Must be able to clean up stopped
Synchronized or asynchronized.
If you have long text, there must be a way to return
some text earlier that than other so it can be played
first.
Good Performance
Synthesizer – in API like multilingual
Markers in the text, can’t require for all 
Keep track of location in stream A if interrupted by
stream b
Don’t handle message process

Open issues-
How to return the audio data, how to line up the
marker with what is being played.  If using a Cstream,
 would we have to have file descriptors in the streams
Audio formats-  We can’t allow all formats, must
select a reason subset.
What happens if call for speech is made while
processing another call.

Al-has anyone looked at the W3C dynamic properties
framework- looking sniffing to see what devices are on
my system.  The spec is moving out of the multimedia
group and transferred into the device independence
group.  They would like to hear what your requirements
are.

Janina- what about multi-language support?  Want to
assign a different voice, maybe speed for a second
language.  You need some markup to point this out. 
Can’t do it off utf, need to put a marker here and
think about.

Olaf- we are doing voice switching in KDEtts also
being done in speech dispatcher.  Just using a spell
check like mechanism to detect might

Larry- the sender of the text should be able to
specify what they are sending.  This allows the
application to try smart about what to do.  In
homepage reader we allow the sender to change
language.  Word definitions might change based on
language.  

Mark- Concerned about 2 things mentioned.  One is
punting on the issue of how speech gets to the engine.
 We haven’t solved the interop problem. If … we have
two instances of the engine.   You want to be able to
control what happens with the audio.  If x , engine
will send it to xxx

Larry- part of the question should be handled by the
audioserver.  If we have classes of audio, we should
be able to use audio server policy rules to decide
what happens.

Pete- lost some our stream some time ago, 

Olaf- about different speech framework.  I don’t see a
way to unify gnome speech and kdess, on KDE 4 we may
want to go with DBUS whi
What can be done is to have speech driver for kdesp , 
loads speech driver and x does as well,  verification
using a local framework and there is one service that
does the real speaking.
Mark- that may work,  if you get more than one layer
in a chain, it gets almost impossible to get the call
backs correctly.  The more lays we add the worse it
get.  Maybe we should create a DBUS based speech xx
and write a API for it.
Olaf- I don’t thing there would be much to added Dbus
calls to the xxx. If speech dispatcher would you the
same DBUS calls it would handle.  Do we want to go
with the least common denominator of the 3 p

Hynek- we need think of one more thing,  general
problem in all accessibility, there are not enough
resources.  I am in support of , we are going for the
minimum set that we all support.  We should all be
prepared to give up some of our features.  It is a
good approach

Milan- a minimum set may not be a satisfactory
solution, what to talk to

Kirk- if looking api’s would you also want to look at
a common format for streaming text.

Gary- he agrees with Larry as longer term goal, but
what is need is better synthesizer.
Will- that is out of our control.
?
Olaf- currently had to use closed system.  Had to buy
500 copies
Gary- might need to tell synth what our minimum
requirements are.  Would be great if they would
support the API directly.
Milan- work on Czech festival
Jeff?- working on something



		
__________________________________ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 





More information about the Accessibility mailing list