UP
From Proc.: Second Nordic Conference on Multimodal Communication, Gothenburg 2005
View pdf version


The Philosophy behind a (Danish) Voice-controlled Interface to Internet Browsing for motor-handicapped

Tom Brøndsted

Dep. of Communication Technology, Aalborg University

Abstracts: The public-funded project "Indtal" ("Speak-it") has succeeded in developing a Danish voice-controlled utility for internet browsing targeting motor-handicapped users having difficulties using a standard keyboard and/or a standard mouse. The system underlies a number of a priori defined design criteria: learnability and memorability rather than naturalness, minimal need for maintenance after release, support for "all" web standards (not just HTML conforming to certain "recommendations"), independency of the language on the websites being browsed, allowance for multimodal control along with the unimodal oral mode, etc. These criteria have lead to a primarily message-driven system interacting with an existing browser on the end users' systems.

Keywords:Alternative web-browsing, voice-controlled applications, e-inclusion, ubiquitous speech processing, accessibility for disabled persons.

The project Indtal ("Speak-it") has aimed at developing a Danish voice-controlled tool for internet browsing targeting motor-handicapped users having difficulties using a standard keyboard and mouse. The project was funded by the Danish National IT and Telecom Agency under the Ministry of Science and ran from primo January 2004 to ultimo February 2005. The project partners were the Department of Communication Technology at Aalborg University being in charge of the actual speech recognition technology and the software company Efaktum in Hjørring implementing the back-end communication between the recogniser and the browser engine of the system. Further the project involved two non-technical partners, "Specialskolen for Voksne, Vendsyssel" and "Teknologicentret for Handicappede, Nordjyllands Amt", institutions located in North Jutland and specialized in compensating courses and consultancy for adults with special needs, including adults with physical disabilities. These non-technical partners have been in charge of project management and contact with an advisory-board of potential end-users, and the maintenance of a website www.indtal.dk where disabled users can download the browser for free.

This paper consists of two parts: 1) The first part outlines five central design criteria characterizing the Indtal-browser as opposed to other alternative browsers addressing disabled users. 2) The second part describes the recognition front-end developed for the system. The first part is to a large extent a summary of a paper submitted to Interspeech 2005.

1. Design & Implementation Criteria

The Indtal-browser differs from other "alternative web browsers" by a number of design and implementation criteria. We define an alternative web browser as a browser offering an alternative to either standard visual output rendering or to standard keyboard and mouse input control - or both. A number of such alternative browsers are listed at W3C's website hosting the Web Accessibility Initiative (WAI) (W3C 2005) and similar websites devoted disabled users. Roughly, we distinguish two major groups both of which can deploy speech recognition and/or speech synthesis (Table 1):

GROUP 1 GROUP 2
TARGET GROUP Visually impaired users Any user preferring hands-free (+ eyes-free) browsing
INPUT A set of function keys (+ an equivalent set of spoken commands), no control of the mouse cursor, mouse clicks etc. A set of spoken commands + dynamically generated commands for activating links on the current page
OUTPUT Structured output sent to a Braille display or a speech synthesizer Unaltered visual rendering enriched with so-called Saycons (+ speech synthesis)
PARSING OF WEB-CONTENT "Deep" parsing of the web pages being browsed No "deep" parsing of the web pages being browsed

Table 1: Major groups of "alternative browsers"

Examples of group 1 are Braillesurf (Hadjadj et al. 1999, Schwarz et al. 2005), BrookesTalk (Zajicek et al. 2000), Emacspeak, and Homer (Mihelic et al 2002). Examples of group 2 are Conversa Voice Surfer (formally Conversa Web) (Robin et al. 1998), HFB (HandsFree Browser by EduMedia) and add-ons shipped with certain versions of IBM's ViaVoice and Dragon Natural-Speaking.

The Indtal-browser belongs to the latter group though it has an explicit focus on end-users with mobile disabilities. Generally web-content is highly visually oriented and it makes in our view no sense to attempt to support eyes-free browsing unless thee needs of visually impaired users are addressed explicitly. Combining hands-free and eyes-free facilities for web browsing hardly makes sense at all. Users with both mobile and visual handicaps are extremely few and their most severe every-day problems do not encompass access to the web.

Apart from the stricter focus on the end-users, the Indtal-browser has been implemented to meet the following five criteria:

Criterion 1: Minimizing Future Maintenance Requirements

Many alternative browsers developed during the last decade have been quietly withdrawn leaving no trace except for the broken links on the referring sites like the WAI site hosted by W3C. One possible explanation for the apparent short life-time of such systems is that they are developed in the framework of research projects or they are (like Indtal) the result of a one-and-for-all funding leaving no resources for subsequent maintenance.

To minimize the requirements for future maintenance, Indtal has chosen a mainly message-driven approach. The system runs on Win32 systems, is dependent only on the Microsoft C-runtime library, and uses Microsoft Internet Explorer as its browser. The alternative possibility of building the system as a modification of open-source browsers like the GTK Web browser Dillo (www.dillo.org) or Mozilla (mozilla.org) was ruled out.

Criterion 2: Allowance for other Input Devices

The message-driven approach described above further has the advantage that speech control can coexist with other "third party devices" generating keyboard and mouse messages to the operative system and the browser engine.

The contact with potential end-users during the design and implementation phase has shown that many of them to some extent are capable of operate a standard PC with some additional equipment, typically short hand-mounted "sticks" to use with standard keyboards, head trackers and eye trackers to generate mouse messages and operate on-screen keyboards, specialized "joysticks" tailored for the end-user who may be able to control the neck, a few fingers etc. These devices can coexist with the oral control of Indtal.

Criterion 3: Support for "all" Web Standards

The Indtal browser aims at a non-normative approach to the format and structure of the web pages being browsed. Many alternative web browsers only (or mainly) support HTML, typically with further restrictions regarding the fulfilment of certain "recommendations" (e.g. the WAI recommendations of W3C, W3C 2005).

For the alternative browsers of group 1 (explicitly addressing visually impaired users) this restriction is unavoidable since they have to employ a deeper "understanding" of the web pages being browsed than group 2. For instance, by allowing visually impaired users to skim the content by outputting only headlines or links, the browser rely on the web content being well-formed and in compliance with the WAI-recommendations or similar.

Alternative browsers of group 2 only encounter similar problems the extent to which they attempt to incorporate also some of the functionality specific to group 1. Otherwise the normative approach to the content being browsed must be considered an unnecessary limitation.

With one exception, the Indtal browser does not attempt to alter or "translate" the standard visual output rendering of web content. The exception is the visual enumeration of links (the HTML <a> elements) that on the users' request are displayed in the browser window (indicating how to activate the links by spoken commands: e.g. "go to link number twenty four"). Hence the parsing of web-content in Indtal is minimal and extremely robust.

Further, to allow users also to access web-content implemented in non-HTML (e.g. activating mouse-over events like pull-down menus implemented in ECMA-scripts), Indtal also deploys a voice-controlled mouse. The system depicts the mouse cursor as the center of a compass with rulers in eight corners (north, north-east, east etc.). Each ruler depicts a point and numbered value for every 100th pixel helping the user moving the cursor by commands like "go north-east two hundred and ten" etc. The cursor can be positioned anywhere on the screen with just two commands, though users (including trained ones) usually need a few more!

Criterion 4: Independence of the language on the web pages being browsed

Danish constitute a small language community, also on the web! We assume that Danes are more likely to view web pages in non-native languages than e.g. English users. Hence, it would be perceived as a severe limitation if the Indtal-browser could only access web pages composed in Danish.

The alternative browsers belonging to group 1 (cf. table 1) have to employ a deep "understanding" of the web-pages being browsed and often language-dependent parsing techniques are used. E.g. intelligent summarizing of (long) documents presuppose language-dependent techniques. Further, if the textual content of web-pages is sent to a speech synthesizer, the language dependency is increased.

Alternative browsers belonging to group 2 need not employ techniques dependent on the language of the web-page being browsed. Many of them are language-dependent either because they implement some functionality otherwise specific only to group 1 or because they support the so-called SayconsTM technology.

The SayconsTM technology implies that links can be activated by dynamically generated voice commands, e.g. that the standard sub sections found in web versions of newspapers can be accessed by commands like go to "Sports", "Domestic News", "International News", etc. This increases the (apparent) naturalness of the application. However, the problems involved are: 1) links (text within the <a>-element and the alt-value of pictures within the <a>-element) must be unique, easy to pronounce, and acoustically discriminative. 2) The links most be composed in the supported language (otherwise the automatic transcription to phonemes will not work).

Due to these problems, the Indtal-browser does not support the SayconsTM technology. The numbering of links described above is the only functionality allowing the user to activate a link by a single command. As a result, the lexicalized vocabulary used in Indtal is closed. This allows for training of whole word acoustic models that are more robust than flexible (vocabulary-independent) models modeling e.g. generalized triphones.

Criterion 5: Memorability and Learnability rather than Naturalness

The motivation for using speech recognition and speech synthesis technology in human-computer interaction is often given in terms of "naturalness" and similar:

Hugh Cameron (Cameron 2000) represents a much more sceptical view when it comes to the use of speech technology in HCI:

"When will people use speech to communicate with machines?

The Indtal-browser explicitly addresses users with mobile disabilities. Hence, the use of speech recognition technology is justified even by Cameron's far more critical criteria. However, the same criteria cannot justify the use of speech synthesis.

2. Front-end Speech Recogniser

The overall architecture of the Indtal-browser system is depicted in figure 1. The front-end consists of the speech recogniser communicating with the actual application (visual for the user only in the form of a bar-like widget) which sends the appropriate messages to the browser-engine of the system (MS Internet Explorer). The HTML-parser which is a part of the back-end application enumerates links and can on users' request force the browser to display the link numbers.

Overall Architecture
Fig 1: Overall Architecture

The speech recogniser of the system resides in a dynamic link library, is application-independent, and to some extent based on the SpeechDat(II) reference recogniser (Lindberg et al. 2000). It has been re-implemented to support real-audio input, hardware-detection and mixer-settings (recording level), endpoint-detection, language-selection based on the PC's locale, pronunciation dictionaries in the SAMPA format, multiple grammars that can be activated and de-activated, and an application-independent API using the established call-back or "listener" paradigm.

To be used in the desktop-environment with acoustic models trained on data recorded over the fixed-line telephony-network (e.g. standard SpeechDat(II)-models), a down-sampling filter simulating telephone bandwidth with standard COMBO-characteristics can be applied.

The recogniser uses core modules from HVite of the HTK toolkit (Young et al. 1997): Front-end feature processing (Mel-scaled cepstral coefficients), internal representation of acoustic models (currently generalized triphones with state tying) and grammar lattices. Further, the actual Viterby decoding algorithm is based on HVite.

The API has lent its basic abstractions from MS Sapi 4.0 and JSAPI where the typical steps for an application to access a speech recogniser include:

  1. creating the recogniser for a specific language and submitting a callback-handle (that retrieves information about speech activity etc.),
  2. creating one or more "rule grammars" and adding a "listener" to each of them (the listener retrieves recognition results including the name of the grammar accepting input, an n-best list where each item consists of a sequence of "tokens" with time-information and score)
  3. enabling and disabling grammars, committing changes, resuming recognition etc.

The recognition module has been implemented entirely in C and C++, however due to hardware detection and support for direct microphone input etc. it only compiles and runs on WIN32 systems. The API has been implemented as an ANSI C application interface allowing access also from other programming languages (e.g. the back-end application of Indtal is implemented in Delphi).

On top on the ANSI C API a further API based on JNI (Java's Native Interface) has been implemented, and a number of interfaces specified for the standard extension package javax.speech.recogniser have been implemented in Java. Thus the recogniser (with the JAVA extension called "JHvite"!) to a large extent is JSAPI compliant.

The application and language independence of the recogniser has proven its worth in a couple of student project. A further demonstration of the independence, a simple multilingual voice-controlled calculator using the dynamic link library with the recogniser shipped with Indtal, can be downloaded from kom.aau.dk/~tb/indtal/. The calculator is controlled by speaking natural numbers (e.g. twenty five point five) and arithmetical operations ( plus, minus, multiplied by, divided by). To control the calculator using e.g. English or German commands, one simply has to add a sub-directory with the corresponding locale-name (en or de) and copy the standard SpeechDat(II) files to that location: 1) models ( tied_32_2.mmf), 2) list file (tied.lis), 3) SAMPA pronunciation dictionary (lexicon.tbl), 4) mapping file if required (phone.map).

Due to ownership-problems, Indtal can only provide users with Danish SpeechDat(II) files (located in the da-subdirectory).

Conclusion

The Indtal-browser has been evaluated by usability-experts at Aalborg University based on interviews with test persons belonging to the target-group (Jensen et al. 2005). The conclusion of this evaluation suggests various improvements to the system. It should come as no surprise that some users are not 100% happy about the accuracy of the speech recogniser. However, we know that mobile handicaps sometimes influence users' ability to articulate normally. A speech database recorded in the desktop-environment to be used for more robust whole-word models have been established during the project, but the released version uses the standard SpeechDat(II) database for triphone-modelling.

Ultimately, the success of the Indtal-browser must be measured based on still unanswered questions like:

For the Department of Communication Technology the Indtal-project has also been a welcome opportunity to establish some general Danish speech recognition resources that can be used in other projects and serve educational purposes.

Biography

Cameron, Hugh: "Speech at the interface". Proceedings of the COST249 Workshop on Speech in Telephone Networks, Ghent 2000.

Hadjadj, Djamel & Dominique Burger: "BrailleSurf: An HTML Browser for visually handicapped people". In Proc. of 14th conference on "Technology and Persons with Disabilities", Los Angeles 1999.

Jensen, Janne Jul, Lars Bo Larsen, Erik Aaskoven, Tom Brøndsted, Christian Gai Hjulmand, Børge Lindberg, Peter P. Pedersen: Bruger-evaluering af indtal.dk, Internal Report, Aalborg University April 2005

Lindberg, Børge & Finn Tore Johansen (2), Narada Warakagoda (2), Gunnar Lehtinen (3), Zdravko Kacic, Andrej Zgank, Kjell Elenius, Giampiero Salvi: A Noise Robust Multilingual Reference Recogniser based on SpeechDat(II). ICSLP 2000.

Mihelic, France & Nikola Pavesic, Simon Dobrisek, Jerneja Gros, Bostjan Vesnicer, Janez Zibert: Homer - A Small Self Voicing Web Browser for Blind People. Laboratory of Artificial Perception, Systems and Cybernetics Faculty of Electrical Engineering, University of Ljubljana, Slovenia, 2002

Robin, Michael B. & Charles T. Hemphill: Considerations in Producing a Commercial Voice Browser, W3C WS on "Voice Browsers". Massachussets 1998

Schwarz, Emmanuel, Gaële Hénault, Dominique Burger: BrailleSurf 4, www.snv.jussieu.fr/inova/bs4/, visited March 2005

W3C: Web Accessibility Initiative (WAI), www.w3.org/WAI/ visited March 2005.

Young, S. & V. Valtchev and P. Woodland: The HTK book (for HTK Version 2.1) Entropic Cambridge Research Laboratory, Mar. 1997.

Zajicek, M. & I. Venetsanopoulis: Using Microsoft Active Accessibility in a Web Browser for the blind and visually impaired. Proc. of the Annual International Conference "Technology and Persons with Disabilities, Los Angeles 2000.


Tom Brøndsted
Department of Communication Technology, Aalborg University
Niels Jernes Vej 12, DK-9220 Aalborg Ø
Office: A6-321,
Tel. +45 96 35 86 36
Email tb@kom.aau.dk