Enhancing the user experience for a word processor 
application through vision and voice 
 
By  
 
Tanya René Beelders 
 
Submitted in fulfilment of the requirements for the degree  
 
 
PHILOSOPHIAE DOCTOR 
 
 
In the Faculty of Natural and Agricultural Sciences 
Department of Computer Science and Informatics 
University of the Free State 
Bloemfontein 
South Africa 
 
 
2011 
 
 
 
Promotor: 
Prof. P.J. Blignaut 
Department of Computer Science and Informatics 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
Two roads diverged in a wood and I - I took the one less travelled by, 
and that has made all the difference. 
~ Robert Frost ~ 
  
 
 
 
  
 
 
 
ACKNOWLEDGEMENTS 
I would like to express my utmost thanks and gratitude to the following: 
• Professor Pieter Blignaut, my promoter for his guidance, assistance and patience throughout this 
undertaking. 
• The staff of the Computer Science and Informatics Department at the University of the Free State for 
their moral support and friendship. 
• My friends and family for their support, understanding and patience. 
  
 
 
 
  
 
 
 
PREFACE 
The study contained within this thesis has, to date, yielded a number of publications. Most recently, a 
submitted manuscript has been accepted for publication as a chapter in an upcoming book on speech 
technologies. The book is currently in press. The following is a list of articles which have been published from 
this work (the publications are reproduced in Appendix I). 
1. Beelders, T.R. and Blignaut, P.J. (2009). A multi-modal interface for a popular word processor. Die 
Suid-Afrikaanse Akademie vir Wetenskap en Kuns Studentesimposium 2009, Bloemfontein, South 
Africa. 
2. Beelders, T.R. and Blignaut, P.J. (2010). Using vision and voice to create a multimodal interface for 
Microsoft Word 2007. Proceedings of the Symposium on Eye-Tracking Research and Applications 
(ETRA), Austin, Texas, United States of America, 173-176. 
3. Beelders, T.R., Blignaut, P.J. and Greeff, F. (2010). Eye-tracking and speech recognition instead of a 
computer mouse. Die Suid-Afrikaanse Akademie vir Wetenskap en Kuns Studentesimposium 2010, 
Pretoria, South Africa. 
4. Beelders, T.R. and Blignaut, P.J. (2011). The Usability of Speech and Eye Gaze as a Multimodal 
Interface for a Word Processor. In I. Ipšić (Ed), Speech Technologies (pp. 385-404). ISBN: 978-953-307-
996-7. 
  
 
 
 
 
 
 
 
  
 
 
 
TABLE OF CONTENTS 
 
 
LIST OF TABLES _____________________________________________________________________ ix 
LIST OF FIGURES ___________________________________________________________________ xiii 
LIST OF CHARTS ___________________________________________________________________ xiv 
CHAPTER 1: INTRODUCTION ___________________________________________________________ 1 
1.1 Introduction _____________________________________________________________ 1 
1.2 Aim ____________________________________________________________________ 1 
1.3 Motivation _______________________________________________________________ 1 
1.4 Problem statement ________________________________________________________ 3 
1.5  Research questions ________________________________________________________ 3 
1.6 Scope ___________________________________________________________________ 4 
1.7 Limitations of the study ____________________________________________________ 4 
1.8 Methodology _____________________________________________________________ 5 
1.9 Outline of the thesis _______________________________________________________ 7 
1.10 Summary ________________________________________________________________ 7 
CHAPTER 2: THEORETICAL BACKGROUND __________________________________________________ 8 
2.1 Introduction _____________________________________________________________ 8 
2.2 Word processors __________________________________________________________ 8 
2.3 Usability and user experience ________________________________________________ 9 
2.4 User interfaces __________________________________________________________ 10 
2.4.1 Perceptual, attentive and non-command user interfaces ___________________ 11 
2.4.2 Brain-computer user interfaces ________________________________________ 12 
2.4.3 Multimodal user interfaces ____________________________________________ 12 
2.4.4  Interaction techniques _______________________________________________ 13 
2.5 Computer users __________________________________________________________ 13 
2.5.1 Types of users ______________________________________________________ 14 
2.5.2 Aged users _________________________________________________________ 14 
2.5.3 Disabled users ______________________________________________________ 15 
2.6 Human modalities ________________________________________________________ 16 
2.61.  Human vocal system _________________________________________________ 16 
i 
 
 
2.6.2 Human vision system ________________________________________________ 17 
2.6.2.1 Physiology of the eye _______________________________________________ 17 
2.6.2.2 Eye movements ____________________________________________________ 17 
2.6.3 Temporal relationship between eye gaze and speech ______________________ 18 
2.7 Speech recognition _______________________________________________________ 19 
2.7.1 How speech recognition works ________________________________________ 19 
2.7.2 Functions of speech recognition _______________________________________ 20 
2.7.3 Considerations and factors influencing speech recognition _________________ 21 
2.7.4 Speech-enhanced user interfaces _______________________________________ 22 
2.7.5 Speech-enhanced word processing _____________________________________ 24 
2.7.6 Using speech recognition to control the cursor ___________________________ 25 
2.8 Eye-tracking_____________________________________________________________ 27 
2.8.1 Hardware __________________________________________________________ 27 
2.8.2 Eye-tracking applications _____________________________________________ 28 
2.8.3 Activation mechanisms _______________________________________________ 29 
2.8.3.1 Dwell time ________________________________________________________ 29 
2.8.3.2 Blinking __________________________________________________________ 30 
2.8.3.3 Look-and-shoot ____________________________________________________ 31 
2.8.3.4 Gestures _________________________________________________________ 31 
2.8.3.5 Pupil size _________________________________________________________ 32 
2.8.4 Using eye gaze in user interfaces _______________________________________ 33 
2.8.4.1 Replacement of the cursor ___________________________________________ 33 
2.8.4.2 Target selection ____________________________________________________ 34 
2.8.4.2.1 Using an ISO standard to assess a pointing device _______________________ 34 
2.8.4.2.2 Increasing accuracy ______________________________________________ 35 
2.8.4.2.2.1 Expansion and magnification of targets __________________________ 36 
2.8.4.2.2.2 Zooming the entire display ____________________________________ 38 
2.8.4.2.2.3 Applicability to the current study _______________________________ 38 
2.8.5 Gaze-based user interfaces in practice __________________________________ 39 
2.8.5.1 Eye typing ________________________________________________________ 39 
2.8.5.2 Other applications of gaze-interaction __________________________________ 42 
2.8.6 Market trends of eye-tracking _________________________________________ 44 
2.9 Multimodal interfaces _____________________________________________________ 45 
2.9.1 Classification of multimodal interfaces __________________________________ 46 
ii 
 
 
2.9.2 Implementation of multimodal interfaces ________________________________ 46 
2.9.3 Eye gaze and speech multimodal interfaces ______________________________ 47 
2.9.3.1  Acquisition and spacing of targets _______________________________________ 48 
2.9.3.2 Applications _________________________________________________________ 49 
2.9.4 Text and data entry using eye gaze and speech ___________________________ 50 
2.10 Summary _______________________________________________________________ 52 
CHAPTER 3: EXPERIMENTAL DESIGN AND METHODOLOGY _____________________________________ 53 
3.1 Introduction ____________________________________________________________ 53 
3.2 Experimental design ______________________________________________________ 53 
3.3 Development of the application _____________________________________________ 53 
3.3.1 Motivation _________________________________________________________ 53 
3.3.2 Hardware __________________________________________________________ 54 
3.3.3 Development tools __________________________________________________ 54 
3.3.4 Interaction techniques _______________________________________________ 55 
3.3.5 Technical specifications ______________________________________________ 59 
3.3.6 Resulting multimodal interface ________________________________________ 63 
3.4 Resolving the empirical research questions ____________________________________ 64 
3.4.1 Feasibility study ____________________________________________________ 64 
3.4.2 Pointing and clicking _________________________________________________ 64 
3.4.2.1 Assessment of a pointing device _______________________________________ 64 
3.4.2.2 Experimental design ________________________________________________ 68 
3.4.3 Word processor functions and text entry ________________________________ 70 
3.4.3.1 Assessment of word processor functions ________________________________ 70 
3.4.3.2  Assessment of text entry ____________________________________________ 71 
3.4.3.3 Experimental design ________________________________________________ 72 
3.5 Statistical analysis ________________________________________________________ 75 
3.6 Summary _______________________________________________________________ 76 
CHAPTER 4: FEASIBILITY TESTING OF THE MULTIMODAL INTERFACE _______________________________ 77 
4.1 Introduction ____________________________________________________________ 77 
4.2 Participants _____________________________________________________________ 77 
4.3 Tasks __________________________________________________________________ 77 
4.4 Limitations______________________________________________________________ 78 
4.5 Results _________________________________________________________________ 78 
4.6 Conclusion ______________________________________________________________ 80 
iii 
 
 
CHAPTER 5: ANALYSIS OF EYE GAZE AND SPEECH TO SIMULATE A POINTING DEVICE ___________________ 81 
5.1 Introduction ____________________________________________________________ 81 
5.2 Participants _____________________________________________________________ 81 
5.3 Trials __________________________________________________________________ 82 
5.4 Sessions ________________________________________________________________ 82 
5.5 Device movement ________________________________________________________ 83 
5.6 Analysis of the throughput _________________________________________________ 85 
5.6.1 Combining the interaction techniques ____________________________________ 85 
5.6.2 Analysing throughput _________________________________________________ 88 
5.7 Analysis of the time _______________________________________________________ 90 
5.7.1 Combining the interaction techniques ____________________________________ 90 
5.7.2 Analysing Time ______________________________________________________ 91 
5.8 Analysis of other measurements ____________________________________________ 93 
5.8.1 Target re-entries _____________________________________________________ 93 
5.8.1.1 Combining the interaction techniques __________________________________ 93 
5.8.1.2 Analysis of target re-entries __________________________________________ 93 
5.8.2 Incorrect target acquisitions ____________________________________________ 96 
5.8.2.1 Combining the interaction techniques __________________________________ 96 
5.8.2.2 Analysis of incorrect target acquisitions _________________________________ 96 
5.8.3 Incorrect clicks ______________________________________________________ 99 
5.8.3.1 Combining the interaction techniques __________________________________ 99 
5.8.3.2 Analysis of incorrect clicks ___________________________________________ 99 
5.8.4 Time to selection ____________________________________________________ 102 
5.8.4.1 Consolidating the interaction techniques _______________________________ 102 
5.8.4.2 Analysis of time to selection _________________________________________ 103 
5.8.4.3 Further analysis of selection times ____________________________________ 104 
5.9 Subjective device assessment ______________________________________________ 105 
5.10 Summary of findings _____________________________________________________ 106 
5.11 Further research ________________________________________________________ 109 
5.12 Summary ______________________________________________________________ 109 
CHAPTER 6: ANALYSIS OF SPEECH COMMANDS IN WORD _____________________________________ 110 
6.1 Introduction ___________________________________________________________ 110 
6.2 Procedure _____________________________________________________________ 110 
6.3 Participants ____________________________________________________________ 111 
iv 
 
 
6.4 Tasks _________________________________________________________________ 111 
6.5 Measurements _________________________________________________________ 112 
6.6 Limitations of this study __________________________________________________ 113 
6.7 Task analysis ___________________________________________________________ 113 
6.7.1 Line selection and formatting __________________________________________ 113 
6.7.1.1 Time to complete task _____________________________________________ 113 
6.7.1.2 Number of actions ________________________________________________ 116 
6.7.1.3 Correctness of task completion ______________________________________ 119 
6.7.2 Select all text and remove_____________________________________________ 120 
6.7.2.1 Time to complete task _____________________________________________ 120 
6.7.2.2 Number of actions ________________________________________________ 122 
6.7.2.3 Correctness of task completion ______________________________________ 124 
6.7.3 Select words and format ______________________________________________ 125 
6.7.3.1 Time to complete the task __________________________________________ 125 
6.7.3.2 Number of actions ________________________________________________ 127 
6.7.3.3 Average time between actions _______________________________________ 129 
6.7.3.4 Correctness of task completion ______________________________________ 131 
6.7.4 Paste _____________________________________________________________ 132 
6.7.4.1 Time to complete the task __________________________________________ 132 
6.7.4.2 Number of actions ________________________________________________ 134 
6.7.4.3 Correctness of task completion ______________________________________ 136 
6.7.5 Undo _____________________________________________________________ 137 
6.7.5.1 Time to complete _________________________________________________ 137 
6.7.5.2 Number of actions ________________________________________________ 139 
6.7.5.3 Correctness of task completion ______________________________________ 140 
6.7.6 Select word and copy ________________________________________________ 141 
6.7.6.1 Time to complete task _____________________________________________ 141 
7.7.6.2 Number of actions ________________________________________________ 143 
6.7.6.3 Correctness of task completion ______________________________________ 145 
6.7.8 Position and Paste ___________________________________________________ 146 
6.7.8.1 Time to complete the task __________________________________________ 146 
6.7.8.2 Number of actions ________________________________________________ 148 
6.7.8.3 Correctness of task completion ______________________________________ 150 
6.7.9 Select all and format _________________________________________________ 150 
v 
 
 
6.7.9.1 Time to complete task _____________________________________________ 151 
6.7.9.2 Number of actions ________________________________________________ 152 
6.7.9.3 Correctness of task completion ______________________________________ 153 
6.8 Summary of results ______________________________________________________ 153 
6.9 Further research ________________________________________________________ 155 
6.10 Summary ______________________________________________________________ 156 
CHAPTER 7: ANALYSIS OF TYPING TASKS _________________________________________________ 157 
7.1 Introduction ___________________________________________________________ 157 
7.2 Participants ____________________________________________________________ 157 
7.3 Tasks _________________________________________________________________ 157 
7.4 Measurements _________________________________________________________ 158 
7.5 Analysis _______________________________________________________________ 159 
7.5.1 Analysis of keyboard and large buttons __________________________________ 159 
7.5.1.1 Error rate ________________________________________________________ 159 
7.5.1.2 Breakdown of error rates ___________________________________________ 162 
7.5.1.2.1 Insertion error percentage _______________________________________ 163 
7.5.1.2.2 Substitution error percentage ____________________________________ 165 
7.5.1.2.3 Deletion error percentage _______________________________________ 167 
7.5.1.3 Characters per second _____________________________________________ 169 
7.5.2 Analysis of all typing tasks_____________________________________________ 171 
7.5.2.1 Error Rate _______________________________________________________ 171 
7.5.2.2 Breakdown of error rate ____________________________________________ 173 
7.5.2.2.1 Percentage of insertion errors ____________________________________ 174 
7.5.2.2.2 Percentage of substitution errors __________________________________ 175 
7.5.2.2.3 Deletion errors percentage _______________________________________ 177 
7.5.2.3 Characters per second _____________________________________________ 179 
7.5.3 Summary of results __________________________________________________ 180 
7.6 Further research ________________________________________________________ 181 
7.7 Summary ______________________________________________________________ 181 
CHAPTER 8: PARTICIPANT SUBJECTIVE SATISFACTION ________________________________________ 183 
8.1 Introduction ___________________________________________________________ 183 
8.2 Procedure _____________________________________________________________ 183 
8.3 Reaction to the application ________________________________________________ 184 
8.3.1 Satisfaction ________________________________________________________ 184 
vi 
 
 
8.3.2 Learnability ________________________________________________________ 186 
8.4 Typing ________________________________________________________________ 187 
8.4.1 Satisfaction ________________________________________________________ 187 
8.4.2 Learnability ________________________________________________________ 189 
8.4.3 Preference and ease of use for typing settings ____________________________ 190 
8.5 Commands ____________________________________________________________ 192 
8.5.1 Satisfaction ________________________________________________________ 192 
8.5.2 Learnability ________________________________________________________ 193 
8.5.3 Types of commands _________________________________________________ 194 
8.6 Additional considerations _________________________________________________ 195 
8.7 Pointing device _________________________________________________________ 197 
8.8 Anecdotal observations __________________________________________________ 197 
8.10 Summary ______________________________________________________________ 199 
CHAPTER 9: CONCLUSION ___________________________________________________________ 200 
9.1 Introduction ___________________________________________________________ 200 
9.2 Motivation _____________________________________________________________ 200 
9.3 Aim __________________________________________________________________ 200 
9.4 Results ________________________________________________________________ 200 
9.4.1 Multimodal word processor ___________________________________________ 201 
9.4.2 Feasibility study _____________________________________________________ 201 
9.4.3 User testing ________________________________________________________ 202 
9.4.3.1 Usability of eye gaze and speech as a pointing technique __________________ 202 
9.4.3.2 Usability of speech commands _______________________________________ 203 
9.4.3.3 Usability for text entry _____________________________________________ 204 
9.4.3.4 Satisfaction ______________________________________________________ 204 
9.5 Recommendations ______________________________________________________ 205 
9.6 Implications for the future ________________________________________________ 206 
9.7 Further research ________________________________________________________ 206 
9.8 Summary ______________________________________________________________ 207 
REFERENCES _____________________________________________________________________ 208 
BIBLIOGRAPHY ___________________________________________________________________ 225 
APPENDIX A _____________________________________________________________________ 228 
APPENDIX B _____________________________________________________________________ 229 
vii 
 
 
APPENDIX C _____________________________________________________________________ 230 
APPENDIX D ____________________________________________________________________ 232 
APPENDIX E _____________________________________________________________________ 234 
APPENDIX F _____________________________________________________________________ 236 
APPENDIX G ____________________________________________________________________ 238 
APPENDIX H ____________________________________________________________________ 241 
APPENDIX I _____________________________________________________________________ 248 
PUBLICATIONS ___________________________________________________________________ 248 
SUMMARY ______________________________________________________________________ 270 
OPSOMMING ____________________________________________________________________ 271 
 
  
viii 
 
 
LIST OF TABLES 
 
Table 3.1: Verbal commands 58 
Table 3.2: Multimodal Add-Ins tab functions 60 
Table 3.3: Matrix of test conditions for ISO testing 69 
Table 3.4: Multi-directional tapping trials 69 
Table 3.5: Word processor functions and text entry testing task list 72 
Table 3.6: Descriptive statistics for phrase set 74 
Table 3.7: Frequencies with which letters occur in selected phrase set 74 
Table 3.8: Most frequently occurring words in selected phrase set 74 
Table 5.1: Grouped interaction techniques 86 
Table 5.2: Average throughput for all interaction techniques prior to consolidation 86 
Table 5.3: Results of normality tests for ETS(F) and ETS(I) throughput 87 
Table 5.4: Results of normality tests for ETSG(F) and ETSG(I) 87 
Table 5.5: Average throughput for the consolidated interaction techniques for all sessions 88 
Table 5.6: Results of the normality tests conducted on the throughput of all interaction techniques 89 
Table 5.7: Results of separate ANOVA on throughput for consolidated interaction techniques 89 
Table 5.8: Results of separate ANOVA on throughput for sessions 89 
Table 5.9: Average times for consolidated interaction techniques 91 
Table 5.10: Results of normality tests on time for consolidated interaction techniques 92 
Table 5.11: Descriptive statistics for the number of target re-entries 94 
Table 5.12: Average target re-entries for consolidated interaction techniques 94 
Table 5.13: Complete repeated-measures analysis results for consolidated interaction techniques 95 
Table 5.14: Descriptive statistics for the number of incorrect target acquisitions 97 
Table 5.15: Average incorrect target acquisitions for consolidated interaction techniques 97 
Table 5.16: Results of ANOVA on incorrect target acquisitions for consolidated interaction techniques 98 
Table 5.17: Descriptive statistics for the number of incorrect clicks 100 
Table 5.18: Average number of incorrect clicks for consolidated interaction techniques 100 
Table 5.19: Results of separate ANOVA on incorrect clicks  for consolidated interaction techniques 101 
Table 5.20: Descriptive statistics for time to selection 102 
Table 5.21: Average time to selection 103 
Table 5.22: ANOVA results of time to selection 103 
Table 5.23: Descriptive statistics for final acquisition times 104 
Table 5.24: Separate ANOVA results for final target acquisition 105 
ix 
 
 
Table 5.25: Results of the device assessment questionnaire 106 
Table 6.1: Task description and grouping 112 
Table 6.2: Grouped tasks as divided between interaction techniques 112 
Table 6.3: Descriptive statistics for time to complete line selection and formatting 114 
Table 6.4: Normality test results from completion time of line selection and formatting 115 
Table 6.5: ANOVA results for the completion time of line selection and formatting 116 
Table 6.6: Descriptive statistics for the number of actions used for line selection and formatting 117 
Table 6.7: Results of ANOVA on the number of actions required to perform line selection and formatting 118 
Table 6.8: Descriptive statistics for completion time of removing all selected text 121 
Table 6.9: Descriptive statistics for the number of actions required to remove all selected text 123 
Table 6.10: Analysis results for the number of actions required to remove all selected text 124 
Table 6.11: Descriptive statistics for the completion time of formatting selected words 126 
Table 6.12: Analysis results for the completion times of formatting selected text 127 
Table 6.13: Descriptive statistics for the number of actions required to format selected words 128 
Table 6.14: Analysis results for the number of actions required to format selected words 129 
Table 6.15: Descriptive statistics for the time difference between actions 130 
Table 6.16: Analysis results for the time difference between actions 131 
Table 6.17: Descriptive statistics for paste time completion 133 
Table 6.18: Descriptive statistics for the number of actions to complete a paste 135 
Table 6.19: Analysis results for the number of actions to complete the paste task 136 
Table 6.20: Descriptive statistics for task completion time for the undo task 137 
Table 6.21: Analysis results for the completion time of the undo task 138 
Table 6.22: Descriptive statistics for the number of actions to complete the undo task 139 
Table 6.23: Analysis results for the number of actions to complete the undo task 140 
Table 6.24: Descriptive statistics for the completion time for selecting and copying a word 141 
Table 6.25: Descriptive statistics for the number of actions to select and copy text 143 
Table 6.26: Analysis results for the number of actions required to select and copy text 144 
Table 6.27: Descriptive statistics for completion time to position cursor and paste text 146 
Table 6.28: Analysis results for completion time to position cursor and paste text 147 
Table 6.29: Descriptive statistics for the number of actions to position the cursor and paste text 148 
Table 6.30: Descriptive statistics for the completion time to select and format all text 151 
Table 6.31: Descriptive statistics for the number of actions to select and format all text 152 
Table 6.32: Summary of significant results 154 
Table 7.1: Descriptive statistics for keyboard and speech-L error rate 160 
Table 7.2: Results of error rate analysis for keyboard and speech-L 161 
x 
 
 
Table 7.3: Descriptive statistics for insertion errors of keyboard and speech-L 164 
Table 7.4: Analysis results for insertion error percentage of keyboard and speech-L 165 
Table 7.5: Descriptive statistics for substitution error percentage of keyboard and speech-L 166 
Table 7.6: Results for the analysis of session for speech-L substitution errors percentage 167 
Table 7.7: Descriptive statistics for the deletion error percentage of keyboard and speech-L 168 
Table 7.8: Analysis results for deletion error percentage of keyboard and speech-L 169 
Table 7.9: Descriptive statistics for characters per second of keyboard and speech-L 169 
Table 7.10: Analysis results for characters per second of keyboard and speech-L 170 
Table 7.11: Descriptive statistics for error rates of all interaction techniques 171 
Table 7.12: Analysis results of error rates for all interaction techniques 172 
Table 7.13: Descriptive statistics for insertion errors percentage of all interaction techniques 174 
Table 7.14: Analysis results for insertion errors percentage of all interaction techniques 175 
Table 7.15: Descriptive statistics for substitution errors percentage of all interaction techniques 176 
Table 7.16: Analysis results of substitution errors percentage for all interaction techniques 177 
Table 7.17: Descriptive statistics of deletion errors percentage for all interaction techniques 177 
Table 7.18: Analysis results of deletion errors percentage for all sessions 178 
Table 7.19: Descriptive statistics of characters per second for all interaction techniques 179 
Table 7.20: Analysis results of characters per second for all interaction techniques 180 
Table 8.1: Example contingency table for overall satisfaction 184 
Table 8.2: Descriptive statistics for each satisfaction question for the application 185 
Table 8.3: Descriptive statistics for overall satisfaction with application 185 
Table 8.4: Example contingency table for overall learnability 186 
Table 8.5: Descriptive statistics for learnability questions for the application 186 
Table 8.6: Descriptive statistics for overall learnability of the application 187 
Table 8.7: Example contingency table for Chi-square test 187 
Table 8.8: Descriptive statistics for satisfaction questions for the typing feature 188 
Table 8.9: Descriptive statistics for learnability questions for the typing feature 189 
Table 8.10: Contingency table for keyboard setup preference 191 
Table 8.11: Example of contingency table for satisfaction with speech commands 192 
Table 8.12: Descriptive statistics for satisfaction questions for the command feature 192 
Table 8.13: Descriptive statistics for learnability questions for the command feature 194 
Table 8.14: Contingency table for satisfaction with moving the cursor 194 
Table 8.15: Descriptive statistics for satisfaction of command types 194 
Table 8.16: Analysis results for satisfaction of additional considerations 196 
Table 8.17: Example of a contingency table for device assessment questions 197 
xi 
 
 
Table 8.18: Descriptive statistics for device assessment questionnaire responses 198 
Table 9.1: Summary of results for speech commands 203 
  
xii 
 
 
LIST OF FIGURES 
 
Figure 2.1: Cross-section view of human vocal system 16 
Figure 2.2: Physiology of the eye 17 
Figure 2.3: Video-based eye-tracking using the reflection of an infrared light source and the centre of the pupil 
to calculate the direction of the eye gaze 28 
Figure 2.4: EyeCon animation of eye closing 29 
Figure 2.5: EyeWrite being used with Microsoft Notepad 32 
Figure 2.6: Invisible expansion of targets  36 
Figure 2.7: EagleEyes application in use 43 
Figure 2.8: Matrix with ROI squares each outlined in a different colour 49 
Figure 3.1: Calibration process in Microsoft Word 55 
Figure 3.2: Onscreen QWERTY keyboard 56 
Figure 3.3: Magnification of the onscreen keyboard 56 
Figure 3.4: (a) Centred and (b) off-centre gaze position indicator 57 
Figure 3.5: (a) Hollow circle and (b) square used as gaze indicators 57 
Figure 3.6: Visual feedback on a selectable target through (a) framing and (b) inverting colours 57 
Figure 3.7: Multimodal Add-Ins tab in Microsoft Word 59 
Figure 3.8: Class diagram of developed application 62 
Figure 3.9: Multi-directional tapping test using ISO9241-9 66 
Figure 3.10: Multi-directional tapping task using eye gaze and speech with target button currently  
having focus 70 
Figure 5.1(a): Mouse path and (b) Eye-tracking (without gravitational well) path of a single participant 83 
Figure 5.1(c): Eye-tracking (with gravitational well) path and (d) Eye-tracking, with magnification,  
path of a single participant 84 
Figure 5.2(a): Mouse path and (b) Eye-tracking (without gravitational well) path of a single participant 84 
Figure 5.2(c): Eye-tracking (with gravitational well) path and (d) Eye-tracking, with magnification,  
path of a single participant 84 
  
xiii 
 
 
LIST OF CHARTS 
 
Chart 4.1: Responses to questionnaire 79 
Chart 5.1: Average throughput for all interaction techniques prior to consolidation 87 
Chart 5.2: Average throughput for consolidated interaction techniques over all sessions 88 
Chart 5.3: Average times for consolidated interaction techniques 91 
Chart 5.4: Average target re-entries for consolidated interaction techniques 95 
Chart 5.5: Average incorrect target acquisitions for consolidated interaction techniques 97 
Chart 5.6: Average number of incorrect clicks for consolidated interaction techniques 101 
Chart 5.7: Average time to selection 103 
Chart 5.8: Average time to final selection for M and ETSG 105 
Chart 6.1: Means for completion time of line selection and formatting 115 
Chart 6.2: Mean number of actions required to perform line selection and formatting 118 
Chart 6.3: Correctness of task - Select lines and format 120 
Chart 6.4: Mean plot for completion time of removing all selected text 122 
Chart 6.5: Mean plot for the number of actions required to remove all selected text 123 
Chart 6.6: Correctness of task - Select all text and remove 125 
Chart 6.7: Mean plot for completion times of formatting selected words 126 
Chart 6.8: Mean plot for the number of actions required to format selected words 128 
Chart 6.9: Mean plot for the time difference between actions 130 
Chart 6.10: Correctness of task - Select words and apply formatting 132 
Chart 6.11: Mean plot for the paste time completion 134 
Chart 6.12: Mean plot for the number of actions to complete the paste 135 
Chart 6.13: Mean plot for the completion time of the undo task 138 
Chart 6.14: Mean number of actions to complete the undo task 140 
Chart 6.15: Mean plot for the completion time for selecting and copying a word 142 
Chart 6.16: Mean for the number of actions to select and copy text 144 
Chart 6.17: Correctness of task completion - Select word and copy 145 
Chart 6.18: Mean plot for completion time to position cursor and paste text 147 
Chart 6.19: Mean number of actions to position the cursor and paste text 149 
Chart 6.20: Correctness of task completion - Position and paste 150 
Chart 6.21: Means for the completion time to select and format all text 151 
Chart 6.22: Mean number of actions to select and format all text 153 
Chart 7.1: Mean error rate of keyboard and speech-L 160 
xiv 
 
 
Chart 7.2: Error-free transcribed text for keyboard and speech-L 162 
Chart 7.3: Breakdown of first and last task's error rates for keyboard and speech-L 163 
Chart 7.4: Mean insertion error percentage of keyboard and speech-L 164 
Chart 7.5: Mean substitution error percentage of keyboard and speech-L 167 
Chart 7.6: Mean deletion errors percentage of keyboard and speech-L 168 
Chart 7.7: Mean characters per second of keyboard and speech-L 170 
Chart 7.8: Mean error rate for all interaction techniques 172 
Chart 7.9: Error-free transcribed text for all interaction techniques 173 
Chart 7.10: Breakdown of first task and last task’s error rate for all interaction techniques 173 
Chart 7.11: Mean insertion errors percentage for all interaction techniques 174 
Chart 7.12: Mean substitution errors percentage of all interaction techniques 176 
Chart 7.13: Mean deletion errors percentage for all interaction techniques 178 
Chart 7.14: Mean characters per second for all interaction techniques 179 
Chart 8.1: Number of responses in each category of the typing feature satisfaction questions 188 
Chart 8.2: Number of responses in each category of the typing feature learnability questions 189 
Chart 8.3: Preference ranking of the onscreen keyboard setups 190 
Chart 8.4: Ease of use ranking for the onscreen keyboard settings 191 
Chart 8.5: Number of responses in each category for satisfaction questions for command feature 193 
Chart 8.6: Number of responses in each satisfaction category for command types 195 
Chart 8.7: Number of responses in each category for additional considerations of using eye gaze  
and speech 196 
 
 
  
xv 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER 1 
INTRODUCTION 
 
1.1 Introduction 
A word processor is a software application which allows for composition, editing and formatting of a printable 
document (wordiQ, 2010). The word processor has become a very popular tool in the everyday use of a 
computer and has displayed a remarkable ability to evolve and incorporate emerging technologies. The 
original word processor was developed by IBM in 1969 (Eisenberg, 1992) and since then it has evolved 
constantly, exploiting the advances in technology. 
As an integral part of everyday life for many people a word processor should cater for a very diverse group of 
users and it offers a unique environment which is rich in potential for improvement of the user experience. 
However, it may be highly unlikely that only one such complex application would be able to offer the best 
possible experience to all users. The word processor and the improvement of the usability thereof are the 
main focus areas of this research study. 
 
1.2 Aim 
The aim of the study is to investigate various means to increase the usability of a word processor for use by a 
diverse group of users, including users of different expertise levels, ages and abilities. Specifically, it will be to 
1
determine (i) whether it is feasible  to incorporate a truly multimodal interface into a popular existing word 
processor application through the use of non-traditional input methods and (ii) how usable such an interface 
2
will be .  
 
1.3 Motivation 
Communication between humans and computers is considered to be two-way communication between two 
powerful processors over a narrow bandwidth (Jacob & Karn, 2003). Most interfaces today utilise more 
bandwidth with computer-to-user communication than vice versa, leading to a decidedly one-sided use of the 
available bandwidth (Jacob & Karn, 2003). An additional communication mode will invariably provide for an 
improved interface (Jacob, 1993a) and new input devices which capture data from the user both conveniently 
and at a high speed are well suited to provide more balance in the bandwidth disparity (Jacob & Karn, 2003). In 
order to better utilise the bandwidth between human and computer, more natural communication which 
concentrates on parallel rather than sequential communication, is required (Jacob, 1993a). The eye-tracker is 
one possibility which meets the criteria for such an input device. Eye-trackers have steadily become more 
robust, reliable and cheaper and therefore, present themselves as a suitable tool for this use (Jacob & Karn, 
2003). However, much research is still needed to determine the most convenient and suitable means of 
interaction before the eye-tracker can be fully incorporated as a meaningful input device (Jacob & Karn, 2003).  
                                                           
1
 A feasibility test is aimed at determining whether the proposed interface is viable and whether it could offer 
a potentially usable interface to any users. Therefore, contrary to a more formal usability study, it does not 
require that objective measurements be captured and analysed statistically. 
2
 This aim will require more formal usability measurements to be captured and analysed. 
1 
 
Chapter 1 Introduction 
Furthermore, the user interface is the conduit between the user and the computer and as such plays a vital 
role in the success or failure of an application. Modern-day interfaces are entirely graphical and require users 
to visually acquire and manually manipulate objects on screen (Hatfield & Jenkins, 1997) and the current trend 
of Windows, Icons, Menu and Pointer (WIMP) interfaces have been around since the 1970s (Van Dam, 2001). 
These graphical user interfaces may pose difficulties to users with disabilities and it has become essential that 
viable alternatives to mouse and keyboard input should be found (Hatfield & Jenkins, 1997).  Specially 
designed applications which take users with disabilities into consideration are available but these do not 
necessarily compare with the more popular applications. Disabled users should be accommodated in the same 
software applications as any other computer user, which will naturally necessitate new input devices (Istance, 
Spinner & Howarth, 1996) or the redevelopment of the user interface. Eye movement is well-suited to these 
needs as the majority of motor impaired individuals still retain oculomotor abilities (Istance et al., 1996). 
However, in order to disambiguate user intention and interaction, eye movement may have to be combined 
with another means of interaction such as speech. This study aims to investigate various ways to provide 
alternative means of input which could facilitate use of the mainstream product by disabled users. 
These alternative means should also enhance the user experience for novice, intermediate and expert users. 
Previous studies (Beelders, 2009; Blignaut, Dednam & Beelders, 2007) show that novice users of word 
processors experience a number of obstacles in acceptance and usage of a word processor that are unique to 
their particular demographic. Alternative pictorial icons, text buttons and translation of the interface into the 
native language of the user all failed to lessen the learning curve or to increase usability significantly. However, 
these findings should not discourage researchers but should serve as encouragement to find more innovative 
and creative means of alleviating the burden on these users. Particularly, since these users show remarkable 
eagerness and enthusiasm to learn, greater effort should be made to accommodate them to become 
mainstream users. Although the main focus could be to narrow the gap between novice and expert users, the 
means to achieve this should not alienate or disrupt the smooth flow of work that an expert user is capable of 
achieving. This study therefore proposes to be an extension or continuation of these aforementioned studies, 
and to investigate further ways to improve the interface of a word processor for all user groups. Eye-tracking, 
which was identified as a possible means of interaction to increase bandwidth use and meet the needs of 
disabled users, also provides a possible means of achieving this for these users. 
The technologies chosen to improve the usability of the word processor are speech recognition and eye-
tracking. As it is, Microsoft Office already comes bundled with an in-built speech engine which makes speech 
recognition available in all Office packages. Speech recognition offers an interaction means capable of 
replacing conventional typing and alleviating strain which may be caused by using an onscreen keyboard. Eye-
trackers may eventually become affordable enough to be a standard feature in future computing devices 
(Isokoski, 2000).  As it is, fairly inexpensive eye-tracking solutions have successfully been developed and used 
within gaze-based solutions (cf. Corno, Farinetti & Signorile, 2002; Haro, Essa & Flickner, 2000). 
However, given that the hardware and software is available, the task remains to prove that the eye-tracker 
improves the quality of human-computer interaction as validation for the inclusion in future devices (Isokoski, 
2000). The underlying foundation of this research undertaking is the view that while eye gaze and speech 
recognition may be prone to ambiguity when used in isolation, using them in combination may allow many of 
the problems to be overcome. User intent can be inferred by providing a means for the user to gaze at certain 
objects and then issue verbal commands which can then be executed to create a hands-free application 
(Hatfield & Jenkins, 1997). In this way it is envisaged that the strengths of one interaction technique will be 
able to compensate for the weaknesses of the other and together speech and vision should provide a better 
interaction experience than each in isolation.  Given the inherent problems associated with target selection via 
eye gaze, such as accuracy, stability and the Midas touch problem (Chapter 2), it seems plausible that an 
additional modality might make selection easier and more feasible. Additionally, the actions required within a 
2 
 
Chapter 1 Introduction 
word processor can all be facilitated through the combined use of eye gaze and speech as interaction 
techniques (He & Kaufman, 1993). 
The goal of this study is therefore to determine whether the combination of eye gaze and speech can 
effectively be used as an interaction technique to replace the use of the traditional mouse and keyboard. 
 
1.4 Problem statement  
The research problem of the study is twofold: firstly to determine whether a multimodal interface using eye 
gaze and speech as interaction techniques is possible and feasible for a word processor; and secondly, as a 
feasible application does not necessarily imply a usable application, to establish the usability of such an 
application by comparing it to standard or traditional interaction techniques currently in use in a word 
processor. 
 
1.5  Research questions 
The research study will be conducted in a series of linear phases, each of which will have its own research 
question. The underlying proposal of the study is to determine whether the combination of eye gaze and 
speech as an interaction technique is a viable solution for a multimodal interface for a word processor. 
Therefore, it will first have to be established whether an existing word processor can be changed or emulated 
to incorporate a multimodal interface. Once this has been achieved, feasibility of this multimodal interface will 
have to be established.  
Following this, the usability of the multimodal interface will have to be tested through extensive user testing. 
For this purpose, three main features which an interaction technique must facilitate within a word processor 
were identified. The user must be able:  
1. to type text into the document; 
2. to use the interaction technique as a pointing device in order to click on icons within the ribbon and 
menu of the application;  
3. to achieve common word processing tasks such as formatting, document manipulation and navigation 
through a document without having to click on an icon or menu option. 
 
There will therefore be three primary research questions in this study, namely: 
1. Can a customisable multimodal interface be developed and successfully incorporated into a 
mainstream word processor with the aim of providing an all-inclusive application to a diverse group of 
users? 
2. How feasible is such an interface and in which context is it feasible? 
3. How usable is the multimodal interface compared to the traditional interaction techniques? 
Based on the identification of the word processing features above, research question 3 could be further 
subdivided into the following secondary questions: 
a. How usable is the combination of eye gaze and speech when used to simulate a pointing device? 
b. How usable are speech commands for performing common word processing tasks? 
c. How usable is the combination of eye gaze and speech when used for text entry? 
3 
 
Chapter 1 Introduction 
Both the first and second research questions are exploratory in nature while the third question is a causal 
question as the effect that the proposed interaction techniques have on the usability of a word processor will 
be examined. 
 
1.6 Scope 
The possibilities presented by the proposed research study are vast and wide-ranging. Therefore, the scope of 
the study must be clearly defined at the outset to avoid scope creep occurring.  
Since the multimodal interface is only now being proposed, this study will include both the development and 
the testing of the feasibility of the proposed interface. By testing the feasibility, it will allow a more learned 
sample to evaluate the potential, both short- and long-term, that the interface offers. 
Thereafter, the usability of the interface must be investigated through objective, measurable usability metrics. 
Since the user base of a word processor is very diverse and the interface proposes to extend this base even 
further, the population which will be concentrated on must be clearly defined. Since the interface has not yet 
been tested, the scope of the study will include testing on proficient able-bodied users only. This will 
determine whether the interface is usable for the context in which it will be used. 
Although the study has identified three main features of a word processor that will be concentrated upon, it is 
not possible to include testing on all the functionality that a word processor offers. Therefore, the tasks that 
will be included in the testing will represent only a subset of the functionality, but will be chosen based on the 
consideration that they are the most commonly used functions in a word processor environment. 
 
1.7 Limitations of the study 
Keates and Trewin (2005) state that in order to provide interfaces which compensate for disabilities, it is 
necessary first to fully understand the difficulties of the users. This implies that each disability will present its 
own challenges and require unique compensatory actions to be taken. This viewpoint is further supported by 
Gajos, Wobbrock and Weld (2008), who evaluated systems which automatically generated adaptable 
interfaces based on individual motor capabilities of users with motor impairments. Since the proposed 
interface may be an ideal solution for disabled users it would have to be tested using disabled users. 
Unfortunately, the scope of the study will not allow for these tests to be conducted, specifically not in the 
order that they will be required. Therefore, a limitation of the study is that only able-bodied users will be 
tested.  
The initial motivation of the study was to provide an interface which is suitable for both novice and more 
experienced users. However, the nature of a longitudinal study, especially within the context of the hardware 
which is required for this study, together with time and budget constraints, was not conducive to the use of a 
large sample. Therefore, only experienced users will be tested as these will not require additional training on a 
word processor. Other target groups will not be tested and will have to be tested in the future in order to 
determine whether the proposed interface provides a viable solution to all users. 
Dwell time, look-and-shoot and blinking will also be added as interaction techniques for use within the 
developed application. However, although these functionalities will be provided, they cannot all be tested 
during the formal usability testing. Therefore, only the proposed solution of eye gaze and speech for text entry 
will be tested and compared to the traditional means of keyboard and mouse. Furthermore, a limited grammar 
for speech input will be tested which implies that it will not be possible to complete all word processing tasks 
4 
 
Chapter 1 Introduction 
through speech commands. Although this is undoubtedly a limitation of the study, it was felt that within the 
scope of the study it was sufficient to provide speech commands for only the common word processor tasks. 
 
1.8 Methodology 
The thesis is based on the premise of testing the principle behind the inclusion of both speech recognition and 
eye-tracking in a word processor application. To this end, the five research questions (section 1.5) were posed. 
Each of these research questions will be answered in turn using its own specific methodology, each of which 
will be discussed further in this section. 
Research question 1: Can a customisable multimodal interface be developed and successfully incorporated 
into a mainstream word processor with the aim of providing an all-inclusive application to a diverse group of 
users? 
In order to make user interaction with the test system as natural as possible, the system must emulate the 
real-world application as closely as possible. Therefore, a popular word processor application will be chosen as 
the application which must be emulated or changed to incorporate the multimodal interface (Chapter 3). Since 
Microsoft Word® is the most popular word processor in the current market, it was chosen as the application 
on which the study would focus. Moreover, Visual Studio Tools for Office (VSTO) allows programmers to add 
additional functionality and change the interface of applications within the Office Suite. Therefore, using these 
and other tools and software development kits (SDKs) which are available, eye gaze and speech functionality 
will be added to Word. By providing a number of means through which additional modalities can be used, the 
interface can be customised to suit the needs of a particular user at any given time. 
This study will make use of surveys and experiments to resolve the empirical research questions, namely the 
second and third research questions. Surveys, both in the form of questionnaires and interviews, will be used. 
Questionnaires will be used in a number of capacities such as to capture user demographics, to measure user 
opinion of as well as user satisfaction with the proposed interface (Appendices A, C - H). Interviews will also be 
conducted with test participants in order to gauge their satisfaction, general impressions and comfort level 
with the application. Interviews will allow more open-ended questions to be posed to participants than would 
be the case with questionnaires. Questionnaires will contain some open-ended questions but for the most part 
the questionnaire will follow a structured approach.  
Research question 2: How feasible is such an interface and in which context is it feasible? 
In order to answer this research question, a feasibility study with a carefully selected sample will be conducted 
(Chapter 4). The sample will be a convenience sample and will be drawn exclusively from a population which is 
familiar with the human-computer interaction field. Since the study will be more qualitative in nature a sample 
size of 5 will be sufficient (Nielsen, 2000). The sole data collection method for this feasibility study will be a 
questionnaire with both closed- and open-ended questions.  
This feasibility review will require participants to give an unbiased opinion of a system as their experience 
should allow them to accurately judge the long-term possibilities of a system, should there be no immediate 
short term benefits. This will allow the viability of the chosen interaction techniques to be determined without 
concentrating on usability measures per se. The aim of the feasibility review is to establish a more subjective 
view about whether the interface which is suggested has long-term usage potential and whether it can offer a 
solution that meets the needs of users. 
 
5 
 
Chapter 1 Introduction 
Research question 3: How usable is the multimodal interface compared to the traditional interaction 
techniques? 
Experiments will be used to answer all three secondary research questions.  Usability experiments in human-
computer interaction (HCI) generally take the form of user testing which requires that representative users 
must perform representative tasks on the application (Al-Qaimari & McRostie, 2001; Dillon, 2001; Preece et al., 
1994; Shneiderman, 1998). Therefore, for each of the secondary questions suitable tests will have to be 
designed which will allow the usability of that particular word processing function to be measured (these tests 
will be discussed in Chapter 3). The International Standards Organisation (ISO) stresses that in order to test the 
usability of a product both the performance and satisfaction of the end-users must be measured in some way 
(ISO, 1998). In order to do this, effectiveness, efficiency and satisfaction must be defined in terms of 
measurable attributes (ISO, 1998; Bevan & Macleod, 1994; Scholtz, 2004). Ultimately, this research study has 
adopted the viewpoint that it is obligatory to select at least one measurement for each of the usability 
components of effectiveness, efficiency and satisfaction. The actual objective measurements which will be 
used will be discussed in Chapter 3. Objective measurements will be complemented by questionnaires 
designed to elicit subjective measurements of usability (Appendices E, G and H).  Each of the user tests will 
make use of a convenience sample as the participants will be sourced from the university at which the study is 
being conducted. For the purposes of the user testing an endeavour will be made to maintain a minimum 
sample size of 20 (Nielsen, 2006). 
Research question 3a: How usable is the combination of eye gaze and speech when used to simulate a pointing 
device? 
The accepted means of testing and comparing pointing devices is through the use of the International 
Standards Organisation (ISO) standard 9241-9 (Chapter 5). This test will be used to test how best to increase 
the usability of eye gaze and speech as a pointing device to such an extent that it may be comparable to the 
performance when using the traditional mouse. The literature review (Chapter 2) will identify possible means 
through which usability can be increased. These will be tested and compared to the use of a mouse as a 
pointing device. 
Research question 3b: How usable are speech commands for performing common word processing tasks? 
User testing will be conducted to compare the use of traditional methods to achieve common word processor 
tasks and the use of speech commands (Chapter 6). These common word processor tasks will include such 
functions as selecting text, formatting of text, navigating through a document and manipulating the text in the 
document (for example, cutting and pasting). These tasks will be of such a nature that they can be completed 
without having to click on an icon or menu option in the application. Speech commands will be provided for 
these tasks so that they can be completed without the use of either a mouse or keyboard. A preset list of tasks 
will require study participants to complete tasks using either a mouse or keyboard and then to complete an 
equivalent task using speech commands. Since it may require some time for participants to become 
accustomed to the speech commands a longitudinal study will be undertaken. This will therefore be a 
repeated-measures within-subjects study. Efficiency measurements, such as time to complete a task, and 
effectiveness measurements, such as the level of correctness with which the task can be completed, will be 
measured and analysed. Furthermore, questionnaires will be used to analyse the subjective measurement of 
user satisfaction. 
Research question 3c: How usable is the combination of eye gaze and speech when used for text entry? 
The final research question will be answered using the same method as for the previous research question. 
Within the task list for the longitudinal testing, there will be a number of tasks which will require the 
participant to type random phrases using either the keyboard or eye gaze and speech (Chapter 7). Efficiency 
6 
 
Chapter 1 Introduction 
and effectiveness measurements will be analysed. Once again, questionnaires will be used to test the 
subjective measurement of satisfaction. 
To round off the exploration of the third research question, subjective satisfaction will be measured using 
established questionnaires (Chapter 8). 
Data analysis will be conducted in order to make insightful conclusions from the data that has been collected. 
For these purposes, descriptive as well as inferential statistical analysis (section 3.5), which will be dependent 
on the data that is collected, will be conducted. 
 
1.9 Outline of the thesis 
This thesis will proceed according to the following outline. Chapter 2 will provide a discussion of the some of 
the available literature. Motivation will also be provided for the study which was undertaken. This will include 
discussions on the technologies which were chosen for inclusion in the study, with their associated 
disadvantages and how these could possibly be overcome. 
Thereafter, Chapter 3 will focus on the experimental methodology and design of the study. Specific details will 
be given of all instruments which will be used or developed in order to explore the research questions. This will 
include the questionnaires which will be used as well as an in-depth discussion of the application which will be 
developed in order to answer the posed research questions. 
Chapter 4 will discuss the results of the feasibility study which was conducted in order to establish the viability 
of the developed multimodal interface. Chapter 5 will report on the user testing which was conducted in order 
to determine how usable the proposed interaction techniques are when used to replace a pointing device.  
The following two chapters (Chapters 6 and 7) will report on the results of the longitudinal user testing which 
was designed to evaluate objective usability measurements for the multimodal interface. This will include the 
comparative analysis with the more traditional means of interaction currently available for a word processor. 
Chapter 8 will then discuss the subjective feelings of the test participants towards the proposed multimodal 
interface. A number of anecdotal observations will also be reported on. 
The final chapter (Chapter 9) will provide a summary of the results found as well as make some 
recommendations for use and further research. 
 
1.10 Summary 
This chapter provided a brief introduction to the study which was undertaken. The motivation for undertaking 
the study stemmed from a number of sources and provided an opportunity for a wide-reaching study with 
broad scope. The scope was, however, narrowed down to a manageable size which sufficed for the purposes 
of the thesis. A number of limitations were identified which have to be considered during the course of the 
study. Finally, the methodology which will be used to answer the research questions was presented and briefly 
discussed. 
The following chapter will provide a more in-depth discussion of some of the available literature which 
provided the basis and motivation for the research study. 
  
7 
 
 
CHAPTER 2 
THEORETICAL BACKGROUND 
 
2.1 Introduction 
The previous chapter gave an overview of the objectives, motivation and methodology which will be used to 
answer the research questions that were posed. This chapter will discuss some of the relevant literature which 
formed the foundation for this study. Various concepts pertinent to the study will be defined and their use 
explained. These include discussions on concepts such as usability, user interfaces in general and computer 
users. 
Previous studies which are related to the current study will be reported on. In particular, the focus will be on 
the modalities of speech and eye gaze. In order to facilitate this discussion, the human physiology behind 
these technologies must be discussed. Following this, the specific technologies of speech recognition and eye-
tracking will be discussed with reference to relevant studies that have used them as interaction techniques. 
Thereafter, the combination of the two within a multimodal interface will be reported on with specific 
reference to how it can be used for text entry and as a pointing device. 
 
2.2 Word processors 
“Word processing, a concept that combines the tdinicgt aand typing functions into a centralized sys,t em
is replacing the one-man, one-secretary, one-tyipteerw irdea in a growing number of firms. By 
organizing the flow of office correspondence on oar em efficient basis, word processing is becomin g to
typing what Henry Ford’s assembly line was to thrieg inoal methods used for automobile making.” 
(Administrative Management Article, December 197s 0c iated in Haigh, 2006, p. 8)
Word processing is a system which allows for the flexible composition, editing, formatting, storage and 
printing of digital documents (Daintith & Wright, 2008) and is often regarded as the first step towards office 
automation (Freedman, 1998). A word processor is therefore, the software that provides these capabilities on 
a computer (Freedman, 1998). 
The word processor application has evolved substantially since its initial inception. The original word processor 
- in the true sense of the word - was developed by IBM in 1969 and was known as the Magnetic Tape Selectric 
Typewriter or MT/ST (Eisenberg, 1992). In this model, keystrokes were recorded on a 16 mm magnetic tape 
and, while the MT/ST was capable of distinguishing between words, lines and paragraphs, the division of the 
full text into pages and the numbering of pages still had to be manually completed by a human operator 
(Eisenberg, 1992). Since then the word processor has undergone a virtual metamorphosis to achieve the 
capabilities that are available in these applications today. The introduction of MS-DOS yielded great 
improvement in the capabilities of word processors with the inclusion of features such as endnotes, footnotes 
and the ability to edit more than one document by utilising the provision of increased memory and disk space 
(Eisenberg, 1992). The introduction of WordStar in 1979 saw the first release of a “what you see is what you 
get” (WYSIWYG) word processor (Bergin, 2006a). Its developers touted WordStar as being the first word 
processor that was capable of showing onscreen page breaks, that had in-line help, was keystroke sensitive, 
had automatic word wrap and allowed users to set the left and right margins (Bergin, 2006a). When Microsoft 
Windows replaced MS-DOS, Microsoft Word became the word processor of choice (Bergin, 2006a; Bergin 
2006b).  
8 
 
Chapter 2 Theoretical Background 
Two trends in the widespread adoption of the word processor are notable. Firstly, when word processing 
became synonymous with a computerised application, this niche in the technology field became the fastest 
growing and most competitive of the field (Haigh, 2006). Secondly, the falling cost associated with such 
technology facilitated the widespread adoption of these tools in business arenas that otherwise might not 
have been possible (Haigh, 2006).  
One drawback of the current word processors in circulation is that they depend heavily on the user’s ability to 
read without impediments and to be able to remember and execute a sequence of actions to perform a 
desired action (Dickinson, Gregor & Dickinson, 2003) – traits which not all word processor users possess in 
equal measures. For example, SeeWord was developed for use by dyslexic users and found to be far more 
suitable for this audience than a WYSIWYG word processor (Dickinson et al., 2003). Since it is evident that the 
word processor is constantly evolving to adapt to the needs of users and to exploit the increased capabilities 
offered by the newer technologies, it offers a unique environment and one rich in potential for improvement 
of the user experience, particularly since the current word processor may assume that its users possess certain 
abilities. Furthermore, the adoption of the word processor by a large group of users as it becomes affordable 
bodes well for the adoption of other technologies which may currently be beyond the budget of mainstream 
users (such as eye-tracking). Should such a feat be emulated and a new application become known for its 
usability and customisability it may enjoy such widespread adoption as the traditional word processor 
originally did. For all these reasons, the word processor and the improvement of its usability were the main 
focus areas of this research study. In particular, the possibility of developing a multimodal interface for a 
mainstream word processor and establishing the usability thereof is the aim of the study. 
 
2.3 Usability and user experience 
There are many definitions available for usability (cf. Shackel, 1991; Shneiderman, 1998; Wixon & Wilson, 
1997). The International Standards Organisation (ISO) formalised the definition of usability in the 9126-1 
standard as “the capability of the software product to be understood, learned, used and attractive to the user, 
when used under specified conditions”. This definition is further expanded upon in ISO 9241-11 where 
usability is defined as “the extent to which a product can be used by specified users to achieve specified goals 
with effectiveness, efficiency and satisfaction in a specified context of use” (ISO, 1998).  
These ISO definitions were considered appropriate for the purposes of the current study and were combined 
into a single definition which encompasses all the salient parts, namely: 
The usability of the system can be measured as the extent to which a software product can 
be used to achieve specified goals with effectiveness, efficiency and satisfaction as well as 
the extent to which it exhibits the capability to be learned and understood by the user. 
The definitions allow measureable components to be extracted in order to determine the usability of a product 
by requiring users to complete certain tasks on the system in question. The four identified components can be 
defined as follows: 
• Effectiveness is how well the user is able to achieve that which must be done by using the system 
(ISO, 1998) and can be measured in terms of accuracy and completeness (Cato, 2001). 
• Efficiency is the amount of resources required to complete the desired task (ISO, 1998), such as time, 
money or mental effort (Bevan & Macleod, 1994). 
• Satisfaction is a subjective feeling and relates to the attitude of the user towards the system (ISO, 
1998). 
• Learnability measures not only the time taken for a user to become familiarised with the system but 
also how well the user is able to remember system functionality (Cato, 2001). 
9 
 
Chapter 2 Theoretical Background 
Each of these components can be measured in some way depending on the task at hand and which part of the 
system is being tested. For the purposes of determining usability in the current study, certain measurements, 
all conforming to the above-mentioned components of usability, will be recorded and analysed. Since usability 
encompasses all of these components, where possible a measure of each of these will be used to provide a 
representative view of the usability of the inspected interface feature. The actual measurements which will be 
analysed for each part of the study will be discussed under the relevant sections in Chapter 3. 
In recent years, there has been a movement towards evaluating the user experience and not simply evaluating 
usability. Similar to usability, user experience has a number of definitions ranging from being synonymous to 
usability, to encompassing beauty, affective or experiential aspects of using technology (Hassenzahl & 
Tractinsky, 2006). Some texts consider user experience to be a broader field than usability and represent it as 
the convergence of usability, branding, functionality and concept (Rubinoff, nd) or “the creation and 
synchronisation of the elements that affect users’ experience with a particular company, with the intent of 
influencing their perceptions and behaviour” (Unger, 2009). 
Essentially, the user experience can be summarised as “a consequence of a user’s internal state, ... the 
characteristics of the designed system ... and the context within which the interaction occurs” (Hassenzahl & 
Tractinsky, 2006) or “the characterisation of what a user feels while using a product” (Paluch, 2009). ISO 9241-
210 defines the user experience as a very subjective concept in terms of the person’s perceptions and 
responses while using the system (ISO, 2010). 
From all these definitions it is clear that the user experience can be interpreted as being much broader than 
the definition of usability which has been accepted for use in this study. This sentiment was echoed in the 
definitive guide on user experience, where Tullis & Albert (2008) reiterate that usability and the user 
experience are two separate concepts with user experience including aspects such as the thoughts, feelings 
and perceptions that result from interaction with the product. They use the term usability and user experience 
interchangeably and advocate the use of usability metrics which measure some aspect of the user experience. 
To this end, efficiency, effectiveness and satisfaction must be measured together with expectations, ease-of-
use, awareness and behavioural and physiological metrics such as eye-tracking, facial expressions and 
measures of stress (Tullis & Albert, 2008). 
While this study will test the usability of the proposed application it will inspect aspects beyond the formal 
usability metrics which have been defined. Interviews will be conducted in an attempt to elicit other self-
reported responses to the application which do not generally fall within the confines of usability. Additionally, 
a smaller study will be conducted to test the feasibility and user reaction to the proposed system. A large 
portion of the following chapter will contain a discussion of the characteristics of the system, how it was 
developed and how it was endeavoured to create a full-scale highly customisable application which caters for a 
large, diverse group of users. Therefore, the term user experience will be used to refer to the all-encompassing 
study which includes all the aforementioned aspects. 
 
2.4 User interfaces 
The user interface is the conduit between the user and the computer and as such plays a vital role in the 
success or failure of an application. A graphical user interface (GUI) is an interface that makes use of input 
devices other than the keyboard (Daintith & Wright, 2008). GUIs usually make use of windows, icons, menus 
and pointing devices, and are therefore often referred to as WIMP interfaces. Currently, the user interface 
finds itself in the dubious situation of being in a design rut, with the current trend of interfaces having been 
around since the 1970s (Van Dam, 2001). Some designers feel that the time has arrived to concentrate on 
discovering innovative post-WIMP interfaces which do not rely solely on menus and icons (Van Dam, 2001). A 
natural consequence of this may be that the mouse and keyboard will no longer be suitable input methods. 
10 
 
Chapter 2 Theoretical Background 
Instead, indications are that future computer interfaces should foster natural and intuitive communication that 
emulates human-human communication (Wachs, Kӧlsch, Stern & Edan, 2011). Since users are often notorious 
for their unwillingness to accept changes in their interaction methods, these interfaces must be accessible and 
not require long periods of learning (Wachs et al., 2011). One of the major challenges facing the HCI 
community at present is the development of these alternative means of input which move away from the 
traditional manual inputs of mouse and keyboard (Miniotas, Špakov, Tugoy & MacKenzie, 2006). This may well 
be the area in which perceptual, attentive, non-command, brain-computer and multimodal user interfaces find 
themselves on the forefront of the technology wave.  
 
2.4.1 Perceptual, attentive and non-command user interfaces 
Perceptual user interfaces coordinate perception using multimodal input and multimedia output modelled 
after natural human-computer interactions (Maglio, Matlock, Campbell, Zhai & Smith, 2000), aimed at allowing 
users to interact with technology in much the same way as they interact with each other (Turk, 2001). 
Perceptual interfaces are interactive and utilise the senses to provide interactions which cannot be 
accomplished through traditional input devices (Turk & Kӧlsch, 2004). An example of a perceptual interface is 
the Kid’s Room, a narrative play-space for children with two walls constructed from video projection screens to 
allow the room to be transformed into a magical play area (Bobick et al., 1999). For example, a river world is 
created where children are encouraged to use the bed in the room as a boat and to row it down the river. 
Should any person exit the boat a splashing sound is heard and the person is encouraged to climb back into the 
boat.  
Attentive user interfaces go a step further than perceptual interfaces since they must not only perceive, but 
also anticipate the user’s next action (Maglio et al., 2000).  For example, in a multi-monitor set-up, the 
attentive interface developed by Ashdown and Sato (2005) automatically moves the mouse cursor to the 
monitor the user was looking at and causes the topmost window on that monitor to receive focus. This would 
undoubtedly be an attractive solution to any user of multiple screens and windows.  
Jacob Nielsen (1993) coined the term non-command interfaces to describe the future generation of user 
interfaces which would infer meaning without having to receive explicit commands from the user. These 
interfaces seem to be very similar to attentive user interfaces but also lean heavily on the theory of ubiquitous 
computing which allows the interfaces to be embedded into the user’s physical environment. Virtual reality is 
an example of a non-command interface as it allows users to immerse themselves in a simulated world and 
move about and interact with the virtual world in the same way as they would in the real world. 
The fact that these three types of interfaces utilise human senses or strive to react to human attention means 
that current input devices will not be sufficient (cf. Dirican & Gӧktürk, 2009; Turk & Kӧlsch, 2004). Instead, 
technologies which allow human senses to be mimicked or reacted to (Turk & Kӧlsch, 2004), such as speech 
recognition or olfactory devices are required. Furthermore, eye gaze position (or the point of regard) is 
possibly the simplest means by which to infer the focus of users’ attention (Just & Carpenter, 1976). For these 
purposes eye-tracking, which will be discussed in a later section, can be used. While perceptual, attentive and 
non-command user interfaces are an exciting and promising new field of research they are beyond the scope 
of the current study. The technologies used in this study are, however, ideal for these interfaces and 
cognisance will be taken of this throughout the study as the potential of the proposed interface stretches far 
beyond that which will be investigated.  
 
11 
 
Chapter 2 Theoretical Background 
2.4.2 Brain-computer user interfaces 
A brain-computer user interface (BCI) is a computer interface that is able to respond to human thoughts and 
intentions (Nijholt & Tan, 2008). Although the main focus of BCIs is to enable disabled users, BCIs may become 
equally acceptable for use within the able-bodied community, in the form of gaming interaction or map 
navigation (Nijholt & Tan, 2008). Although many may shy away from the use of such a seemingly complex 
system, the fact remains that it presents certain advantages above and beyond traditional interaction devices 
– particularly in environments where the user’s hands are busy and additional interaction devices are required 
to increase productivity, or in systems where bandwidth is insufficient, such as some gaming environments 
(Nijholt & Tan, 2008). Similar to the way in which BCIs can be justified for use within the mainstream computer 
population, so too can eye-tracking and speech recognition. 
BCIs are beyond the scope of the current study but remain an exciting prospect for the future of user 
interfaces. 
 
2.4.3 Multimodal user interfaces 
In order to define a multimodal interface it is necessary first to define a modality. A modality is a means of 
communication using one of the five human senses or type of computer devices that is equivalent to human 
senses (Jaimes & Sebe, 2005) or the way an action is performed (Coutaz & Caelen, 1991). Similar to usability, 
there are a number of definitions available for multimodal interfaces, some of which are listed below: 
• A multimodal interface uses a combination of communication means using the human senses or 
equivalents thereof, thereby responding to multiple input channels (Jaimes & Sebe, 2005). 
• A computer system is said to be multimodal if “it supports human modalities such as gesture, written 
or spoken natural language” (Coutaz & Caelen, 1991). 
• Multimodal interfaces “process two or more combined user input modes – such as speech, pen, 
touch, manual gestures, gaze and head and body movements – in a coordinated manner with 
multimedia system output” (Oviatt, 1999). 
• A multimodal interface is one in which several input and output modalities are combined in an effort 
to assist human-computer communication through utilising natural human communication channels 
(Pireddu, 2007). 
• The aim of a multimodal interface is to make a computer behave in a fashion similar to human 
communication which should facilitate easier learning and use (Kaukènas, Navickas & Telksnys, 2006). 
There are common elements to these definitions, for example that the purpose is to make interaction more 
natural through the emulation of human-human communications. For the purposes of this study, a 
combination of the definitions will be used to define a multimodal interface. The following definition therefore 
applies: 
A multimodal interface uses several human modalities which are combined in an effort to 
make human-computer interaction easier to use and learn by using characteristics of 
human-human communication. 
Multimodal interfaces themselves date back to 1980, when Richard Bolt, in his seminal work entitled “Put that 
here” (Bolt, 1980), combined speech and gestures to select and manipulate objects. Using a projected image of 
a workspace, a media room was used to create the impression of a virtual workspace as opposed to simply 
working on a computer. Speech recognition provided a grammar through which commands could be issued to 
create, move and change onscreen elements. Gestures were interpreted so that elements could be positioned 
and moved from one location to another. Commands such as “Create a blue triangle here” would cause a blue 
12 
 
Chapter 2 Theoretical Background 
triangle to be drawn where the user was pointing. Similarly, saying “Move this there” would move an element 
from its position, indicated by pointing at it, to the new location the user was pointing at. 
The following year, using the same media room with a projected workspace, Bolt was responsible for the first 
gaze-controlled interface when he tested an application called World of Windows (WOW) (Bolt, 1981). WOW 
was essentially a gaze-controlled interface which allowed multiple windows to be displayed simultaneously to 
a user. Based on the user’s gaze, a window could be “zoomed into” by concentrating on it long enough 
(analogous to dwell time) or through an additional action such as a spoken command. The currently zoomed 
window could be reset by either looking away from the workspace or gazing at another window in the display.  
A distinct advantage of multimodal interfaces is that they offer the possibility of making interaction more 
natural (Bernhaupt, Palanque, Winkler & Navarre, 2007). Furthermore, a multimodal interface has the 
potential to span across a diverse user group, including varying skill levels, different age groups as well as 
increasing accessibility for disabled users whilst still providing a natural, intuitive and pleasant experience for 
able-bodied users (Oviatt & Cohen, 2000).  This statement played a significant role in the motivation to 
undertake this study. Current interfaces are not capable of living up to these expectations and a multimodal 
interface that meets these needs must still be discovered. 
Therefore, this study will propose a multimodal interface for a word processor application which makes use of 
natural human modalities in an attempt to provide a hands-free, intuitive, easy to use and learn interface 
which can cater for a diverse group of users. In this regard, speech offers an intuitive means of communication 
which requires very little to no training in its application within a computer interface. Coupled with speech, 
humans often make use of their hands, body language and eye gaze to infer meaning and intent with the 
spoken words. These human qualities offer a wide variety of possibilities in this new generation of interfaces 
and this provides ample motivation for the investigation into a multimodal interface using speech and eye 
gaze, which forms the basis of the suggested multimodal interface in this study. 
 
2.4.4  Interaction techniques 
Using a physical input device in order to communicate or perform a task in human-computer dialogue is called 
an interaction technique (Foley, Van Dam, Feiner & Hughes, 1990 as cited in Jacob, 1995a). However, for the 
purposes of this study, the definition will be modified and used in the following context: 
An interaction technique is the use of any means of communication in a human-computer 
dialogue to issue instructions or infer meaning. 
The proposed multimodal interface will provide a number of interaction techniques to heighten the 
customisability it offers to the user. In this way, it may be possible to cater for a very diverse group of users as 
previously stated. Interaction techniques may oftentimes be a single modality but the term will also be used to 
refer to a number of modalities which are combined into a single interaction technique. 
 
2.5 Computer users 
Users are those people who will eventually interact with the product or application (ISO, 1998). The profile of 
computer users has changed from the inventors themselves (first generation), to the technocrats and 
computer professionals (third generation), to include everybody in the current and future generation of user 
interfaces (Nielsen, 1993). This has a direct consequence in that applications must cater for a large and diverse 
user base in order to ensure the continued use of the application. One of the primary tasks which must be 
completed in any graphical user interface is to position a cursor over an object and select that object (Keates & 
13 
 
Chapter 2 Theoretical Background 
Trewin, 2005). This action can prove difficult for older users and users with disabilities (Keates & Trewin, 2005). 
Some of the different types of computer users, including aged and disabled users, will be concisely discussed in 
the following sections. 
 
2.5.1 Types of users 
Any computer user, regardless of physical abilities or age, can be classified according to level of expertise. The 
level of expertise is measured in terms of experience with both the task and the interface domain 
(Shneiderman, 1998). Novice users have little knowledge of either the task or the interface concepts while 
first-time users have sufficient knowledge of the task concepts but limited knowledge of the interface concepts 
(Shneiderman, 1998). Knowledgeable intermittent users have knowledge of both the task and the interface 
domain but their infrequent use of the interface prevents adequate retention of the interface components 
(Shneiderman, 1998). The final category of user is the expert user who displays extreme competence in both 
domains (Shneiderman, 1998). 
Originally, the aim of the study was to test as many of the user categories as possible on the multimodal 
interface using eye gaze and speech. However, it quickly became clear that the scope would be far too large; 
therefore the target group was limited to expert word processor users so that no training on word processor 
use would be needed. Since the proposed multimodal interface was a new concept, there could be no users of 
any expertise levels for the interface. There could, however, be users of the individual modalities of eye gaze 
and speech. However, the exclusivity of the modalities and the cost of high quality equipment needed for the 
modalities, especially eye-tracking, reduced the chances of finding a large enough sample size. Therefore, it 
was decided to rather focus on first-time and novice users of both of these modalities. 
 
2.5.2 Aged users 
The aging IT population could have severe consequences for the design of user interfaces. There are several 
factors unique to the aging demographic that must be accounted for in user interface design. These users 
experience a reduction in light sensitivity, colour perception, dynamic and static visual acuity and contrast 
sensitivity (Murata, 2006). Furthermore, there is a decrease in sensory and motor function (Thomas, Basson & 
Gardner-Bonneau, 2008) which impacts on the use of the mouse by increasing the pointing time (Murata, 
2006). Aging users will play a pivotal role in the development of assistive technologies as indicated by the 
projected statistics on the age of computer users. In the United States, 1 in 5 workers in 2020 will be over the 
age of 55 – an increase of over 50% from the year 2000, when only 13% of the workforce fell in that category 
(Thomas et al., 2008). The same phenomenon is evident in Asia Pacific and Europe where 20% of the Japanese 
and Italian population were 65 or older in 2006 (Thomas et al., 2008). 
A representative sample in terms of age and gender showed that when positioning the cursor, older users take 
longer and pause more often than younger users do (Keates & Trewin, 2005). Murata (2006) conducted an 
experiment to determine the impact of aging on mouse handling amongst three age groups, namely, young, 
middle-aged and older (Murata, 2006). All groups consisted exclusively of men, all of whom had experience 
with mouse handling but none of whom had ever used an eye-tracker. Participants were required to point to a 
predefined target on screen first using a standard mouse and then using the eye-tracker. In terms of just 
mouse pointing time, it was found that older adults required more time to point with a mouse than younger 
adults, which indicates a lengthening of manual input time required as age increases. All age groups performed 
better when using eye gaze to select targets with a marked decrease in the difference between the groups 
when using eye gaze. Results show that eye gaze pointing can be mastered and performed at high speed 
regardless of the user’s age, and is a recommended pointing device for the older computer generation. 
Younger adults preferred pointing with a mouse while middle-aged and older users found the eye gaze 
14 
 
Chapter 2 Theoretical Background 
pointing to be very easy. A limitation of Murata’s study is, of course, that only one gender was tested and the 
results were not verified with females as well. Nevertheless, it is acknowledged that aging users have 
characteristics that must be considered and which are unique to this user demographic but which can possibly 
be overcome through the use of non-manual interaction techniques.   
While aged users will not be tested in the current study, this discussion is relevant in the development of a 
multimodal interface. While it has been theorised that eye gaze might not be suitable for aging users due to 
the natural effects of aging on the eye, results from previous studies fail to verify this theory. Therefore, eye 
gaze remains a possible means of interaction for older users and may still be a more usable interaction 
technique than the mouse. Similarly, results for speech-enhanced user interfaces with older users have yielded 
mixed results (Basson, Fairweather & Hanson, 2007) and further investigation is required. The combination of 
the two modalities has, as far as can be ascertained, not been tested with this demographic. Therefore, this 
group will be taken into account when designing a multimodal interface and the importance of testing these 
users on the proposed interface is recognized and suggested for future research based on the findings of the 
current study.  
 
2.5.3 Disabled users 
A computer can play a vital role in the everyday lives of users with functional impairments (Keates, Hwang, 
Langdon, Clarkson & Robinson, 2002). However, the effective use of a physical input device is heavily 
dependent on the ability of the user to feel, or have the knowledge that they are touching or holding the 
device as well as being able to move, manipulate and operate the device (Bates, 2002). Regrettably, traditional 
input devices, such as the mouse and keyboard, are designed with able-bodied users in mind. Consequently, 
disabled users cannot always manage to use these devices to their full potential or even use them at all (Su, Su 
& Chen, 2005). The most sensible way of empowering disabled users is to provide them with means to be able 
to use the same software applications as any other computer user, which requires that input devices 
specifically tailored for these users will have to be developed (Istance et al., 1996). Therefore, it is imperative 
that alternative means of interaction be found for these users as this will allow them to use technology on a 
level comparable with able-bodied users. 
Although no official statistics are available for South Africa, it is estimated that between 400 and 500 spinal 
cord injuries are sustained annually (Quadriplegic Association of South Africa, nd). In the United States of 
America, this figure escalates dramatically with 11 000 sustained injuries every year with a total of 250 000 
spinal cord injured Americans (www.sci-info-pages.com). Eye movement is ideal for such situations as learning 
time may be reduced through the use of a “natural” means of pointing (Istance et al., 1996). Furthermore eye 
movement is high-speed (Istance et al., 1996) and the majority of motor impaired individuals still retain 
oculomotor abilities (Hornof, Cavender & Hoselton, 2004). However, the disadvantages associated with eye-
tracking as an input device mean that it should be used with caution or, as suggested by Istance et al. (1996), it 
should ideally be combined with other input modalities which will provide a means to overcome the 
limitations of eye-tracking, such as speech. For example, when using eye gaze as a pointing device, speech can 
be used as a triggering event instead of just using eye gaze. Speech can also offer a means of interaction to 
these users in terms of text input, as in the case of Speech Dasher (section 2.9.4). However, there may be 
instances where the vocabulary of the user is limited and dictation is not possible. This study proposes a 
solution to the problem by allowing text input using eye gaze and speech. This combination may prove to be 
more usable than using eye gaze in isolation and may offer a means of text input for those users who are 
incapable of using speech recognition to its fullest potential. 
It would be very beneficial to test the proposed interface using disabled users since they could very well be the 
user group which will benefit the most from such an application. However, the fact that disabilities can be 
wide-ranging and unique to an individual makes it very difficult to infer findings back to a general population 
15 
 
Chapter 2 Theoretical Background 
and analysis will have to be performed carefully on such small samples. Therefore, in order to increase the 
applicability of the findings, disabled users were not included in the current study but remain a prospect for 
future research. The possible needs of these users will be considered throughout and the decisions made will 
reflect this. For example, the fact that these users may have restricted mobility or limited vocabularies will be 
a consideration throughout.  
 
2.6 Human modalities 
As established in a previous section (section 1.3), the current study will focus on the development and testing 
of a multimodal interface which uses eye gaze and speech. In order to understand the technologies which 
make the use of these human modalities possible, it is necessary first to understand the human physiology 
behind the human vocal and vision system. The subsequent sections will briefly discuss these. 
 
2.61.  Human vocal system 
When humans speak (Figure 2.1), air is forced from their lungs through their mouths and nasal cavities and 
then changed by the lips and tongue (Forsberg, 2003). The air exhaled from the lungs during speech causes 
oscillations of the vocal chords which are situated in the larynx (Fitch, 2000). Acoustic energy is produced and 
filtered (Fitch, 2000) to eventually create discernible sounds.  
Since speech is the most common form of communication between humans its incorporation into user 
interfaces offers the possibility of a more natural human-computer interaction. When humans speak, they do 
so expressively, which means they use their eyes, hands and body to convey more meaning than with just 
simple spoken words. Therefore, the combination of speech with one of these expressions would seem to be a 
most natural means of communication. For the purposes of this study, speech was coupled with eye gaze. 
 
 
 
Figure 2.1: Cross-section view of human vocal system 
Source: www.msu.edu 
16 
 
Chapter 2 Theoretical Background 
2.6.2 Human vision system 
2.6.2.1 Physiology of the eye 
The eye (Figure 2.2) is an organ which is responsible for collecting light and sending it to the brain to be 
processed into images (Yale Medical Group, nd). The outer layer of the eye consists of the anterior transparent 
cornea and the posterior sclera which is a dense, opaque, fibrous tissue (Atchinson & Smith, 2000). The cornea 
is responsible for the most refraction of light while the lens is responsible for accommodation, which is 
achieved by changing the shape of the lens as required (Gregory, 1966). The iris is situated on the middle layer 
of the eye (Atchinson & Smith, 2000). Pigmentation is present in the iris, giving humans the colouring of their 
eyes (Gregory, 1966). The inner layer of the eye is the retina which is connected to the brain via the optic 
nerve (Atchinson & Smith, 2000). The pupil is a tiny hole formed by the iris through which light passes to reach 
the lens and then fall onto the retina to form an image (Gregory, 1966).  
The visual field can be divided into three areas, namely the foveal, parafoveal and peripheral areas (Rayner, 
1998). The central region of the retina, called the fovea, is very densely packed with receptors (Gregory, 1966) 
and in order to see an object clearly, the eye is moved, using the six oculomotor muscles, so that the fovea, 
which has the highest visual acuity, is placed on the object of interest (Rayner, 1998). Objects that are large 
enough can be accurately identified in the peripheral vision (Rayner, 1998). 
 
2.6.2.2 Eye movements 
There are various types of eye movements in a human, some of which will briefly be discussed in this section.  
 
 
 
 
Figure 2.2: Physiology of the eye 
Source: Yale Medical Group (nd) 
Eye movements are required to locate a stationary object and are essentially a series of rapid jerks, known as 
saccades (Gregory, 1966). Visual sensitivity is reduced during saccades (Rayner, 1998). Apart from saccadic 
movement, there are three different eye movements which can be identified. They are (i) pursuit, which 
occurs when the eye follows a moving target, (ii) vergence, which occurs when the eyes are both moved 
inward in order to fixate upon an object and finally (iii) vestibular movements are compensatory movements 
to maintain visual direction and are made in response to head and body movements (Rayner, 1998). Since 
17 
 
Chapter 2 Theoretical Background 
these are all rapid eye movements, they are not of significance in this study and therefore, will not be 
discussed any further or taken into account during the experimental design or analysis of the collected data. 
Between saccades, the eyes experience a period in which they remain relatively still (Rayner, 1998). These 
periods of stability are called fixations and generally last between 200 and 300 milliseconds (Rayner, 1998). A 
fixation occurs when an individual attempts to maintain their eye gaze on a stationary point (Ditchburn & 
Ginsborg, 1953), which can be regarded as focusing of attention on a specific object. During a fixation three 
different eye movements are present namely, tremor, drifts and microsaccades (Martinez-Conde & Macknik, 
2008). These are collectively referred to as fixational eye movements. 
The purpose of tremor, also known as nystagmus, is unclear but it may be responsible for assisting the nerve 
cells of the retina to keep firing in order to ensure perceptual acuity (Rayner, 1998). Drifts occur 
simultaneously with tremor and are slow motions of the eye, possibly used in the absence of microsaccades to 
maintain accurate visual fixation (Martinez-Conde, Macknik & Hubel, 2004). Due to imperfect control of the 
oculomotor system, the eyes sporadically experience small drifting movements away from the fixated target 
and then a microsaccade occurs to compensate for this drift and to move the eyes back to where they were 
(Rayner, 1998).  
There are also three types of overshoot in saccadic eye movement, namely dynamic overshoot, glissadic 
overshoot and static overshoot (Bahill & Clark, 1975). A static overshoot is corrected using a corrective saccade 
but when the eye stops short of the intended target it tends to drift to its final position (Bahill & Clark, 1975). 
These slow drifts are called glissades (Bahill & Clark, 1975). 
The focus of this study will be to use eye gaze, specifically in the capacity of fixations to indicate intention, as 
an interaction technique within a multimodal interface. Therefore, it will be necessary to determine how 
fixations can be used. These methods will be discussed in section 2.8.3. When using eye gaze for interactive 
purposes, fixational eye movements play a role in the stability of the eye gaze pointer. Since eye gaze will be 
used as a pointer in the current study, it is necessary to determine how the accuracy and stability of eye 
pointing can be improved. A more in-depth discussion of these techniques will follow in section 2.8.4.2.2. 
 
2.6.3 Temporal relationship between eye gaze and speech 
When engaged with objects, the eyes tend to look directly at the objects but the fixation which provides the 
information required to interact with the object occurs prior to the action (Land & Tatler, 2009). 
Psycholinguistic studies have also shown that there is a temporal relationship between eye gaze and speech 
(cf. Just & Carpenter, 1976; Tanenhaus, Spivey-Knowlton, Eberhard & Sedivy, 1995), often referred to as the 
eye-voice span. The eyes move to an object before the object is mentioned (Griffin & Bock, 2000) with an 
approximate interval of 500 milliseconds between the eye movement and speech (Velichkovsky, Springer & 
Pomplin, 1997 as cited in Kammerer, Scheiter & Beinhauer, 2008). However, recently it has been shown that 
these fixations on objects of interest could occur anywhere from the start of a verbal reference to 1500 
milliseconds prior to the reference (Prasov, Chai & Jeong, 2007). While the relationship between eye gaze and 
speech could be confirmed in a separate study, a large variance in the temporal difference between a fixation 
and a spoken reference to an object was also found (Liu, Chai & Jin, 2007) which could explain the various 
temporal differences reported on in different texts. In a situation where multiple objects must be referred to 
in a single verbal utterance, the next object is already being fixated upon while speech is being produced for a 
certain object (Griffin, 2001).  
Eye gaze has been successful in resolving ambiguities when using speech input (Tanaka, 1999). However, when 
implementing systems which use both eye gaze and speech, it is important to respond to the input channels by 
correctly identifying how to synchronise the two. It has been found that for the majority of verbal requests, 
18 
 
Chapter 2 Theoretical Background 
users were looking at the object of interest when the command was issued. For the remainder of the 
instances, users tended to look at the object more before the request was issued than after the request was 
issued (Maglio et al., 2000). More specifically, where eye gaze and speech were combined in an interface it 
was found that input events will generally occur within a range of 60-100 milliseconds of one another (Kaur, et 
al., 2003). 
Since eye gaze and speech will be used as an interaction technique in the current study, the temporal 
relationship between the two modalities is of relevance as it may impact the usability of the interaction 
technique. In order to maximise the disambiguation of both modalities, the user will be expected to maintain 
eye gaze on the desired object whilst issuing the verbal command to interact with that object. 
 
2.7 Speech recognition 
Automatic speech recognition (ASR) is the process whereby human speech is interpreted in a computer 
(Forsberg, 2003) through the process of mapping the acoustic signals generated by the human vocal system to 
words (Jurafsky, 2000). Speech that is captured is first digitised, confirmed against a dictionary and then 
converted and, if required, displayed as typed text (Freedman, 1998). 
The first foray into speech recognition produced a toy dog dubbed Radio Rex in the 1920s which recognised its 
name and emerged from its doghouse when called (Russel & Norvig, 2009). However, the first electronic 
speech synthesiser was only developed in 1936 by AT&T Bell Labs (Russel & Norvig, 2009). In the early 1970s 
the Defense Advanced Research Projects Agency (DARPA) took an interest in speech recognition and funded 
four projects to develop high performance speech recognition systems (Russel & Norvig, 2009). It was however 
only in the 1980s that speech recognition became commercially available (Dragon Naturally Speaking, nd). 
With the first release of speech recognition engines, the technology was expensive and therefore not suitable 
for the mass market. Since then, advances in technology as well as in the fields of digital signal processing, 
pattern matching and classification algorithms have made speech recognition commercially attainable even for 
personal computer environments (Karl, Pettey & Shneiderman, 1993). This has allowed speech recognition to 
become widely available to the extent that it is now a standard feature of current computers. Since the 
technology is now readily available to the general user population, ways in which it can effectively be utilised 
must be investigated. 
 
2.7.1 How speech recognition works 
Vocal communication between humans and computers can either be in the form of text-to-speech (TTS) 
synthesis or ASR, also known as speech-to-text conversion. Algorithms designed for synthesis have been more 
successful than those designed for recognition, due to the complexity of interpreting speech (O’Shaughnessy, 
1995).  
Essentially, ASR is a pattern recognition task which requires that the received speech signal be matched to 
corresponding text (O’Shaughnessy, 1995). Pattern recognition tasks generally have a training phase which is 
followed by a recognition phase (O’Shaughnessy, 1995). The training involves the creation of a reference 
memory or a dictionary of speech patterns (O’Shaughnessy, 1995). Recognition involves a number of steps, 
namely (1) normalisation, (2) parameterisation, (3) feature extraction, (4) similarity comparison and (5) a 
decision. Normalisation involves the removal of variability in the input signal as a consequence of the 
environment, after which the signal is divided into parameters and features. Parameters constitute the 
outputs from standard speech analysis while features are the outputs of further analysis (O’Shaughnessy, 
1995). Recognition is then attempted by comparison of the input signal with reference templates obtained 
19 
 
Chapter 2 Theoretical Background 
during training (O’Shaughnessy, 1995). A decision as to what text must be output is then made, based on the 
template with the closest match to the received signal (O’Shaughnessy, 1995). However, if the match is too 
poor the decision must be postponed (O’Shaughnessy, 1995). 
In order to successfully recognise boundaries between words, some technologies require speakers to pause 
briefly between word utterances (O’Shaughnessy, 1995), a situation which hardly supports natural 
communication between human and computer. These types of isolated word recognition engines can only 
process speech at a rate of 20 to 100 words per minute (O’Shaughnessy, 1995). However, connected systems 
allow sequences of concatenated words while continuous recognition affords the speaker the naturalness of 
speaking fluently (Nusbaum, De Groot & Lee, 1995) and are typically capable of processing between 150 and 
250 words per minute. This does, however, increase the intricacy of recognition substantially (O’Shaughnessy, 
1995).  
The grammar which will be used for the current study consisted of commands comprising only a few words at 
a time. Therefore, any one of the afore-mentioned speech engines would be suitable. However, since the 
interface has to exhibit potential for a multitude of interaction techniques, a continuous speech recognition 
engine will be used to provide for more natural speech interaction in the future. All participants in the study 
will be expected to complete training to create a speech profile for themselves. This will increase the 
recognition accuracy and improve system reaction to the grammar. 
 
2.7.2 Functions of speech recognition 
Speech recognition can fulfil two types of functions, namely dictation and command and control (Pireddu, 
2007). Dictation is primarily used to transcribe documents into a digital form (Pireddu, 2007). Command 
recognition requires a spoken word to be verified against a grammar and then the system responds to that 
command. Therefore, speech recognition and generation is ideal for use in situations where the environment 
is hands-busy, eyes-busy, mobility-required or hostile and is very promising for use in telephonic services 
(Shneiderman, 2000).  
For the purposes of the study, the naturalness and theoretical high performance levels which can be achieved 
with continuous speech recognition is ideal. Within the environment of a word processor, dictation may be of 
the utmost importance as the use of dictation will allow the user to speak aloud and have the spoken text 
automatically transcribed to the current document. Memorised text can be spoken 5 times faster than it can 
be written, however since the composition of the desired text consumes the majority of the time taken, 
dictation may only increase a writer’s speed by 20 to 65 percent (Schmandt, Ackerman & Hindus, 1990). Even 
so, this may be a significant improvement over typing speeds. Although a fully functional dictation engine will 
be provided, the testing and comparison thereof to standard text input is beyond the scope of the study as the 
focus of the study is on determining the usability of different means of text input. Therefore, participants will 
be expected to input text using an onscreen keyboard one letter at a time. The question which is now posed is 
whether this will result in increased typing speeds or whether the requirement of concentrating on the 
onscreen keyboard will affect the composition skills of the participant. Theoretically, the onscreen typing will 
be similar to typing using a keyboard but the effort required to use the eyes as a control device may cause 
additional strain and negatively affect the compositional speed of the user. Free-writing text will not be tested 
in this study. 
The second way in which speech will be incorporated is through the compilation of a grammar containing 
common word processor commands. When these commands are issued and recognised by the speech engine, 
their word processing counterparts will be executed. This will alleviate the need to use either the mouse or 
keyboard to edit or format word processing documents. In addition, the speech grammar will also allow text 
input via an onscreen keyboard when used in conjunction with eye gaze to establish user attention. 
20 
 
Chapter 2 Theoretical Background 
2.7.3 Considerations and factors influencing speech recognition 
Some tasks simply interfere with one another as they draw on the same cognitive resources and as such 
cannot be performed to their full potential when executed in parallel (Suhm, 2008). For example, most 
humans can type and think simultaneously but find it much harder to speak and think at the same time 
(Shneiderman, 2000). Hand-eye coordination is accomplished in a different part of the brain which allows it to 
be performed in parallel with problem solving (Shneiderman, 2000). It was found that when issuing commands 
such as “bold” and “page up” there was a marked increase in the speed with which tasks could be 
accomplished. However, when coupled with a memorisation task followed by a “page down” command users 
found it difficult to complete the task and had to repeatedly scroll back to the symbols that had to be 
memorised (Shneiderman, 2000).  
Moreover, users voiced concern over the commands that had to be memorised to complete tasks in a word 
processor environment (Karl, Pettey & Shneiderman, 1993). Memorisation and the level of difficulty required 
by the user to remain within boundaries of the stipulated sublanguage belonging to the domain are important 
factors when designing a speech recognition system (Forlines, Schmidt-Nielsen, Raj, Wittenburg & Wolf, 2005). 
However, in this instance it is somewhat surprising that participants were concerned over the amount of 
memorisation required for command execution. Within the confines of the word processor environment there 
is a grammar somewhat unique to its environment and learning this terminology is part of the learning curve 
of a word processor. Therefore, the use of these within a spoken grammar should not require additional 
memorisation on the part of the user and they should comfortably be able to stay within the confines of the 
provided grammar as it would closely resemble the names and descriptions already used within the 
application. The grammar used in the study by Karl et al. (1993) consisted of only 18 words, the majority of 
which were well-known word processing terms which the users should have been familiar with. It is envisaged 
that it will only be the less commonly used tasks which will require some effort to retrieve from memory but 
since this is the case even when the icons or menu are used it can hardly be considered a drawback of speech 
recognition.  
Nevertheless, in order to alleviate the memorisation required by the user, an application can also be tasked 
with the responsibility of learning a vocabulary which gives the user the freedom to provide speech commands 
that are understandable and intuitive to them. For example, after the Java-based jfig drawing application was 
adapted by Gorniak and Roy (2003) to include speech, users could train the application to associate their 
chosen speech commands to actions in the application. Additionally, as a consideration to ensure that the user 
stays within the confines of the permitted sub-language, it is possible to design a context sensitive menu, in 
the same vein as the Things To Say (TTSay) menu used in a dialogue system for thermostat control 
(Freudenthal, Keyson, DeKoven & De Hoogh, 2001). When the system determines that specific commands are 
applicable to the current context in which the system finds itself, a menu displaying these acceptable 
commands can be shown to the user to restrict them to a limited number of commands. In such a way it is 
analogous to a context-sensitive menu or enabling of allowable menu items. Furthermore, it is comparable to 
the pop-up context toolbar and the context-sensitive ribbon (replacement for the menu and standard toolbar) 
that has been in use from Word 2007 onwards. Regardless of the strain that memorisation of commands 
places on the user, using speech for non-dictation purposes is faster than using the mouse and keyboard as 
confirmed by Cohen, McGee and Clow (2000) although dictation was found to be slower than text input using 
the keyboard (Sears et al., 2001). 
The underlying fact that all participants in the current study will be experts in word processing was the 
influencing factor in deciding not to place the burden of learning the grammar on the application. The 
commands included in the grammar will be tailored according to the terminology of a word processor and 
should not be beyond the abilities of the user. Therefore, it is theorised that the bulk of the grammar will 
quickly be adopted by the users and those isolated incidents where there is no verbal counterpart for the 
command, such as with the keyboard navigation, will easily be learnt by the users. Furthermore, providing a 
21 
 
Chapter 2 Theoretical Background 
grammar with visual examples of the effect the command will have, may actually serve to ease the learning 
curve of novice and first-time users, particularly for users attempting to teach themselves the intricacies of 
word processing. Participants in the study will be closely monitored to determine the difficulty experienced in 
memorising the command list provided. 
An additional consideration of speech recognition is the switching between voice commands and typing, which 
can be “disruptive” (Morrison, Green, Shaw & Payne, 1984 as cited in Karl et al., 1993) but unfortunately, all 
systems which use ASR in some way face the challenge of correctly distinguishing between dictation and 
commands. A rocker switch on the microphone has been used to toggle between dictation and command 
issuing (Oviatt et al., 2000) but this eliminates the possibility of a truly hands-free environment. For the 
purposes of the current study toggling between command and dictation is irrelevant since dictation will not be 
tested, although solutions should still be considered. Since eye gaze will be included, dwell time (section 
2.8.3.1) can be used to toggle between command and dictation mode by simply glancing at the ribbon. This 
may however be considered just as disruptive as the mechanisms for toggling between the states. 
Alternatively, a voice command can be used to toggle between dictation and commands which would logically 
seem to be the fastest method of switching states. 
Other factors which influence the performance of speech recognition in a given situation include fatigue, effort 
and stress (Nusbaum, De Groot & Lee, 1995). The human ability for resilience and flexibility can, however, 
counteract these effects and humans can be taught to use a speech recognition system effectively through 
learning to overcome these shortcomings (Nusbaum et al., 1995). Users of speech recognition must, however, 
be vigilant that they do not resort to “hyperarticulation”, which occurs when a speaker attempts to speak 
more clearly in the event that the system repeatedly fails to recognise the word (Oviatt et al., 1998). The 
recognition system is also heavily dependent on the environment in which it is used as this environment is 
subject to ambient noise, as well as possible conversations that are not directed at the recognition system but 
which may be interpreted by the system as being so (Suhm, 2008). Consequently, the system should ideally be 
trained under the same conditions under which it will eventually be used (Nusbaum et al., 1995). 
For this reason, training of the speech profiles will take place in the same venue, ideally under the same 
conditions under which the tests will be conducted. Participants will be observed during their test sessions to 
establish whether they learn to adapt and master the use of speech recognition. 
 
2.7.4 Speech-enhanced user interfaces  
“How do people want to talk to their computers –d  adno they want to talk to them at all?” (Berg, 
Grӧber & Weicht, 2010, p .19) 
The proliferation of computers in everyday life has ensured that most people use a computer on a daily basis 
and for this they depend heavily on the standard keyboard for text input (Feng & Sears, 2004). Speech 
technology is an exciting concept which provides an ideal input device for significantly reducing the large 
amount of typing that must be performed. Although typing is an integral part of computer use, it requires high 
levels of practice and many users are not able to achieve high typing speeds even with prolonged use of a 
keyboard (Feng & Sears, 2004). However, speaking is an innate ability and most individuals are capable of a 
high average rate of spoken words per minute. The high incidence of such afflictions like tendinitis, carpal 
tunnel syndrome and repetitive strain injuries also provide ample motivation to reduce typing requirements 
and device manipulation (Klarlund, 2003).  Considering the possibility of high speed recognition when using 
continuous speech recognition provides ample motivation to incorporate ASR as a standard input method in 
typing-intensive environments, such as a word processor or text editor. Taking into account that an 
experienced typist can reach average speeds of 68 words per minute (Logan & Crump, 2009; Liu, Crump & 
Logan, 2010), ASR provides a means to considerably increase efficiency.  
22 
 
Chapter 2 Theoretical Background 
However, a limitation of speech recognition is the fact that it may have to be used in an open-plan 
environment or public forum which could infringe on the privacy of the user (Suhm, 2008). Furthermore, a 
user may feel that speaking to a computer is unnatural and may find it embarrassing talking to a machine, 
although they may quickly become accustomed to it. This sentiment was evident in one user’s response to 
using speech in a study conducted by Nelson (1986) - “at first it was kind of strange and almost like you were 
sitting there talking to yourself, but once we got used to it and I started working with it full time, it was a lot 
faster”. 
A Wizard of Oz (WOZ) experiment entails a simulation of the intended environment through having a 
facilitator respond to user commands as though the environment was doing so. Such an experiment was 
conducted in order to determine whether these limitations out-weigh the advantages, whether users were 
prepared to use speech as an input technique and in what context they would use it (Berg, Grӧber & Weicht, 
2010). Results indicated that users had a tendency to make use of the more familiar GUI interaction, except 
when faced with a complex task. Under these circumstances, users switched to verbal communication. User 
reaction was positive to the use of speech recognition and many users indicated that they would like to use it 
in future even though it may be unnatural or embarrassing. Thus, the heavy reliance on the GUI was seen as a 
direct result of familiarity with such interaction.  
Results on the type of commands which were issued were inconclusive, with some users resorting to issuing 
commands based on menu wording and others using more task-oriented commands by translating the task 
they wished to complete into complex instructions (Berg et al., 2010). Therefore, since it is unclear whether 
grammars should be structured in a menu-orientated or task-orientated manner, the current study will resort 
to a menu-oriented grammar in an attempt to lessen the learning curve and memorisation required. 
The use of speech recognition as an input method in popular graphics editors was found to be feasible by 
Yankelovich (2008). However, this implementation did not incorporate actual speech recognition but used a 
WOZ experiment where a facilitator took over control of the mouse. This setup should minimise errors since 
the facilitator is aware of the intentions of the user and the actions the user is currently busy with. 
Furthermore, interpretation is easier since the “speech engine” has human interpretive capabilities and does 
not rely on a fixed grammar or require training in order to be able to understand the user. Therefore, while 
showing that such an interface is feasible, it does not conclusively prove that speech recognition will be 
successful when incorporated into such an environment as the limitations associated with such technology 
were not present with the human facilitator. 
Speech recognition has also been successfully used in a digital music retrieval system (Forlines et al., 2005) and 
even in the specialised domain of computer programming, which is not suitable to speech recognition in its 
natural state. VoiceCode allows code to be dictated in a straightforward manner, including the dictation of 
variable names which would not normally be recognised by a standard ASR engine (Désiltes, Fox & Norton, 
2006).  
Early speech recognition studies yielded promising results and highlighted the advantages of using speech 
recognition. A 96.8% accuracy rate and 17.5% reduction in completion time was achieved for speech 
recognition commands in a simulated military command and control application (Poock, 1982). Speech 
recognition significantly decreased the error rate in an airline baggage handling system (Nye, 1982) and in a 
language-directed program editor (Leggett & Williams, 1984). Conversely, studies have shown that dictation 
does not achieve the same rate of words per minute as when speaking in a natural forum, mainly due to the 
high incidence of recognition errors, and the difficulties experienced in correcting these via voice commands, 
although dictation speed does increase with the experience level of the user (Feng & Sears, 2004).  
The positive results achieved in some studies urge exploration into using speech recognition in everyday 
applications, particularly as a means to increase efficiency and learnability of these applications. If the use 
23 
 
Chapter 2 Theoretical Background 
thereof is successful and also serves to limit the physical consequences of extended computer use, the speech 
recognition could meet the expectations of future generation of user interfaces. Therefore, the current study 
will investigate the possibility of using such a technology within a word processor. 
 
2.7.5 Speech-enhanced word processing 
Although ASR has been used in applications such as form filling and personal digital assistant (PDA) 
applications, the situation is much more complex in an editing environment as there is no natural language to 
express editing commands (Klarlund, 2003). It has therefore been suggested that natural language is 
unsuitable and inefficient for use in an editing application (Klarlund, 2003). For example, in order to insert 
three exclamation marks one would have to issue commands such as “insert three exclamation marks” or 
“exclamation mark, exclamation mark, exclamation mark”; this requires far more time and effort than simply 
typing the three exclamation marks in successive key strokes (Klarlund, 2003). However, by making use of 
symbolisation, it may be possible to lighten the load on the user and make natural language commands more 
intuitive and more efficient than keyboard input (Klarlund, 2003). To test this premise, ShortTalk uses 
symbolised editing concepts that can be concatenated into phrases. For example, ShortTalk uses such 
commands as “Goop” for “go up”, “Loon” for “New line” and “Go aift hello” for “place the cursor to the left of 
the occurrence of hello”. A counting system is also provided through the commands “Ain, Twain, Traio, Fairn, 
...”. The belief is that learning this symbolisation language will be easier than learning an editing language, as 
evidenced by the human ability to string letters into words and words into sentences (Klarlund, 2003). The 
afore-mentioned symbolisation system seems to require much memorisation and an unnatural way of 
expressing yourself and it seems doubtful that many people would go to the lengths required to learn the 
symbolisation system. Therefore, such a method will not be considered for the current study. 
In a study conducted by Karl et al. (1993), four word processing tasks were identified that were deemed 
suitable for voice commands. Sixteen users, fifteen of whom were new to speech recognition, performed four 
simple word processing tasks using both speech recognition to issue commands and traditional direct 
manipulation. The first task could be completed using voice commands or the mouse – no typing was required. 
Users were given an unformatted document and required to reformat using predefined styles of “bullet”, 
“figure”, “figure-label”, “text”, and two section header styles “level one” and “level two”. The mouse was used 
to select the appropriate text and then the user either had to issue a verbal command or use the mouse to 
navigate to the correct menu item depending on which input group they were part of. The second task 
required the users to type a formula containing subscripted and superscripted letters, bold text and Greek 
letters. The ratio of keystrokes to uttered commands was measured at 1.65 to 1. The third task was to create a 
table of symbols using only copy, paste, up and down commands. The ratio of minimum required keystrokes to 
minimum required voice commands was 5.57 to 1. The fourth task required subjects to type a short paragraph 
which contained such word processing elements as bold and italicised text, subscripted and superscripted 
letters. It was compulsory to type the paragraph from left to right and activate the commands as they were 
needed. The keystroke to command ratio was 12.4 to 1. Performance time was reduced by 18.6% when using 
speech recognition and was significantly faster than direct manipulation. While error rates remained the same 
for both input groups, significantly more memorisation errors were made when using speech recognition. 
Users were enthusiastic about using speech recognition for command activation but expressed hesitation over 
concerns for region accuracy, background noise, inadequate feedback and slow response time. Negative 
comments about the speech technology included the low reliability of command recognition as well as the 
possibility of inadvertently inserting unwanted text into your document when engaging in conversation 
separate from your task at hand. These findings would seem to support the call for a function to switch the 
speech recognition off in a quick and easy manner.  
24 
 
Chapter 2 Theoretical Background 
The lack of feedback also caused some concern amongst the participants (Karl et al., 1993).  The commands 
used in the study of Karl et al. (1993) largely corresponded with word processing terminology. Clearly, the 
commands are more efficient than direct manipulation in terms of the number of actions and completion time 
of tasks. A drawback of this study was that navigation was not accommodated through speech recognition and 
for those purposes the mouse still had to be used. The fact that this allowed text to be selected and a voice 
command to be immediately issued could be the reason for the faster completion time as use of the mouse 
only would require the users to move the mouse from the selection to the required icon or menu item. It 
would have been more beneficial to require text selection through the use of speech recognition as well and 
then compare the time required to complete the task. As reasoned in a prior section and in keeping with the 
study of Karl et al. (1993), the current study will also compile a grammar which resembles standard word 
processing terminology with which the participants should be familiar. Cursor control and navigation will also 
be provided for in the grammar so that a completely hands-free environment can be created and the usability 
thereof tested. The means through which cursor control can be used will be discussed in more detail in the 
following section. However, the current study will not allow participants to mix the modalities so that the test 
conditions can be controlled for and comparisons between the interaction techniques can be made. An 
additional drawback of this study was that participants were only tested once therefore the learning curve of 
the speech commands could not be established. The current study will compare measures across a period of 
time, thereby allowing the learning curve of the proposed system to be determined. 
Children using word processors perform text entry faster when using speech recognition as opposed to 
keyboard or mouse entry or handwriting recognition (Read, MacFarlane & Casey, 2001). However, text entry 
via speech recognition also has the highest error rate of these entry methods (Read et al., 2001). The 
implication of this study is that, for children, the learning curve for text entry via speech recognition is not as 
steep as the traditional methods. Since it can be assumed that children would not have the keyboard 
proficiency of more mature users it can be stated that the younger users were still experiencing learning for 
keyboard typing. The fact that they could achieve faster speeds with speech recognition indicates that the 
naturalness with which this interaction technique can be adopted is swifter than learning to type fast on a 
keyboard. The higher error rates could decrease over time as the language skills of the children improve and 
the speech engine becomes more accustomed to their individual profiles. In that case it is imperative that a 
satisfactory method of error correction be found to negate the consequences of the errors.  
In conclusion, in terms of word processing speech recognition appears to be a viable option for both younger 
and more mature users although some shortcomings (i.e. navigation and memorisation) do need to be 
overcome. Solutions to these obstacles have been proposed and the current study will attempt to test 
empirically whether these solutions are viable. 
 
2.7.6 Using speech recognition to control the cursor 
One of the most fundamental usability problems associated with ASR in text editing is the need to correct, 
manipulate and format text after it has been dictated (Vergo, 1998). This is an important consideration since (i) 
people seldom dictate grammatically correct, well-organised text, (ii) ASR is not 100% accurate so there are 
errors in the dictated text and (iii) most users prefer to edit text after they have dictated as opposed to while 
they are dictating it (Oviatt et al., 2000).  The previous discussion has shown that using speech recognition can 
increase task completion times. However, if the user is tasked with positioning the cursor and correcting text 
errors, then task completion with speech recognition is slower and less accurate than with keyboard input 
(Haller, Mutschler & Voss, 1984). Furthermore, it was found that error correction using speech recognition is 
somewhat problematic, particularly for novice users who often get caught in a web of corrections, whereas 
more experienced users revert back to keyboard use to correct errors (Shneiderman, 2000).  
25 
 
Chapter 2 Theoretical Background 
For the purpose of correcting text, cursor control is often needed. The main types of cursor control are defined 
as being either target-based or direction-based (Sears, Lin & Karimullah, 2002). Target-based cursor control 
receives and reacts to commands which explicitly identify a target, such as a word that must be selected (Sears 
et al., 2002). Conversely, direction-based cursor control involves directing the movement of the cursor in a 
certain direction, for example, “move the cursor three words left” (Sears et al., 2002).   
An associated problem with continuous cursor movement is that the cursor may overshoot the target since the 
application requires time to react to a command. Multiple cursors can be used to overcome this – one to 
indicate the current cursor position and one to indicate the position where the cursor will stop (Karimullah & 
Sears, 2002). Continuous cursor movement will also probably be slower than manual manipulation of the 
mouse as the cursor will have to move slowly enough to be tracked by the user. Therefore, it is not an ideal 
solution. 
Alternatively, a grid-based cursor control system has been proven to work effectively (Dai, Goldman, Sears & 
Lozier, 2004). In this scenario, the screen is divided into a grid with each grid square being allocated a number. 
Should the user wish to place the cursor over a target, they simply vocalise the number of the square in which 
the target is situated. The grid gets progressively smaller with each vocalisation until the target can be 
acquired. The grid can also be moved by issuing vocal commands. The grid based solution has an increased 
speed of 33% over other cursor control systems and error rates were 70% lower with large targets and 85% 
with smaller targets. Therefore, it would seem as though the grid offers a potentially accurate method of 
controlling cursor movement with voice commands.  
However, the presence of the grid on an already full screen will likely hamper the performance of the user as it 
will infringe on the relevant information which must be displayed, particularly in terms of a word processor 
which already has a full task bar. Furthermore, the editing area should, as far as possible, be left free of 
irrelevant clutter to allow users to perform at maximum potential. The use of a grid remains a possibility if the 
visibility thereof can be switched on and off as needed. Perhaps when the first cursor control command is 
issued, the grid can be overlaid on the working area and when the navigation and editing is complete the grid 
can be hidden from view automatically. Even so, the fact that the grid may have to shrink to the size of a 
character from the relatively large area of the original document implies that a substantial amount of effort 
may be required to position the cursor correctly. Therefore, the grid-based navigation as proposed by Dai et al. 
(2004) will still have to be coupled with some other mechanism.  
Another possibility of combating the problem of cursor movement is through the use of a multimodal 
interface, such as the case with the Human-Centric Word Processing (HCWP) system. This system combines 
ASR and natural language understanding with pen-based pointing and selection gestures which allow the user 
to issue editing commands which are sensitive to location based utterances (Oviatt et al., 2000). These 
locations (for example, “this”, “here” and “there”) are then interpreted using the position and nature of the 
pen-based gesture. Eye gaze would also be an ideal solution to this problem as the user could simply look at 
the required location when issuing the command and the system could interpret it accordingly. The added 
advantage of using eye gaze and not gestures is, once again, the provision of a completely hands-free 
environment.  
The simultaneous use of eye gaze or gestures and speech seems to be the most promising of the cursor control 
methods discussed here. Since some studies (cf. Bolt, 1981; Oviatt et al., 2000) have already been conducted 
which investigate this type of communication and shortcomings have been identified for the other discussed 
methods, the current study will try a new approach which was not encountered in the literature. While 
standard commands to move the cursor (e.g. left, right, up and down) will be provided, more complex 
selection commands will also be catered for. These commands will be analogous to moving the cursor 
efficiently using the keyboard only. Since efficient and effective cursor movement can be achieved without 
having to remove one’s hands from the keyboard, providing equivalent speech commands could achieve the 
26 
 
Chapter 2 Theoretical Background 
same results. This study will determine whether these will be sufficient to facilitate efficient cursor movements 
or whether they are dependent on the knowledge of the user with regard to navigation using a keyboard.  
 
2.8 Eye-tracking 
2.8.1 Hardware 
The device used to measure eye movements is known as an eye-tracker (Duchowski, 2007) where eye-tracking 
is the measurement of “the spatial direction (gaze and eye fixation) of where the eyes are pointing” (Dvorak, 
2007, p. 283). There are two ways in which eye movement can be monitored. Firstly, the position of the eye 
can be determined relative to the head and secondly, the point of regard, which is the orientation of the eye in 
space, can be determined (Duchowski, 2007). 
The study of eye gaze predates computing technology by many decades, dating back as far as 1878 (Jacob & 
Karn, 2003). Consequently, eye-tracking research can be compartmentalised into distinct eras, the first of 
which is widely accepted to have begun in 1879 when Javal recorded observations concerning the function of 
eye movements in reading (Rayner, 1998). This era extended till 1920, and concentrated on the discovery of 
basic facts concerning eye movement such as saccadic suppression and latency, and perceptual span (Rayner, 
1998). The second era spanned a period of applied focus, coinciding with the behaviourist movement in 
experimental psychology (Rayner, 1998). From the late 1950s to the mid-70s very little research was done 
concerning eye movements. A resurgence of interest in the mid-70s resulted in improvement in the eye 
movement recording systems with the result that the measurements are easier to obtain and far more 
accurate (Rayner, 1998). The development of eye-tracking hardware can be subdivided into four distinct 
groups (Duchowski, 2007), namely:  
• Scleral contact lens/search coil and electro-oculography 
• Those that use video- and photo-oculography 
• Eye-trackers which are analogue video-based combined with pupil/corneal reflection 
• Those using digital video-based combined pupil/corneal reflection which can be augmented by 
computer vision techniques and digital signal processors  
The use of electro-oculography requires that electrodes be placed around the eye which allow the electric 
potential differences of the skin to be measured (Duchowski, 2007). Scleral contact lenses are a very precise 
eye-tracking method which necessitates that a contact lens be placed directly on the eye (Duchowski, 2007). 
The second category of eye-trackers measure distinguishable features of the eye under rotation/translation 
while video-based eye-trackers are capable of measuring the point of regard (Duchowski, 2007). This can be 
done in one of two ways, namely the head must remain stationary so that the point of regard and the eye’s 
position relative to the head are identical or head movement must be disambiguated from eye rotation by 
measuring a number of ocular features, for example corneal reflection and the pupil centre (Duchowski, 2007).  
When using corneal reflection, light sources, for example infrared, are shone into an individual’s eyes. The 
technology behind the infrared eye-trackers is based on the fact that an infrared LED that is shone on the 
human eye causes a reflection spot that remains static (Figure 2.3) regardless of the direction the eye is 
looking (Drewes & Schmidt, 2007). Specifically, light falling on the curved cornea is reflected back, creating the 
four Purkinje images, the first of which is tracked by video-based eye-trackers (Duchowski, 2007).  
The fourth category of eye-trackers are based on the same principles as video-based corneal reflection eye-
trackers, but they use digital optics instead of analogue video (Duchowski, 2007). The use of digital signal 
27 
 
Chapter 2 Theoretical Background 
processors significantly increases the accuracy and the usability of the eye-trackers whilst simultaneously 
causing a decrease in the cost (Duchowski, Cournia & Murphy, 2004).  
 
 
Figure 2.3: Video-based eye-tracking using the reflection of an infrared light source and the centre of the pupil to 
calculate the direction of the eye gaze 
Source: Drewes and Schmidt (2007) 
There are also gaze estimation methods which use visible light and not infrared light or which do not extract 
features of the eye, such as the dual- Purkinje method (for a more in-depth discussion see Hansen & Ji, 2010). 
Eye-trackers can differ in terms of position of the cameras used, the illumination used as well as the type of 
data they produce and how the data is analysed (Holmqvist et al., in press). In broad terms they can be 
categorised as head mounted eye-trackers, head-trackers and static eye-trackers (Holmqvist et al., in press). 
The head mounted eye-tracker is worn on the head of the participant and uses a camera to record the 
stimulus (Holmqvist et al., in press). A head-tracker adds the functionality of tracking the position of the head 
to the head-mounted eye-tracker (Holmqvist et al., in press). The so-called static eye-tracker can be subdivided 
into a remote or tower-mounted eye-tracker (Holmqvist et al., in press). Both use illumination, such as infrared 
lighting, and a camera which are positioned in front of the participant. However, the tower-mounted eye-
tracker requires physical contact with the participant and restrains head movement, while the remote eye-
tracker requires no contact with the participant (Holmqvist et al., in press). The remote eye-tracker offers the 
advantage of non-invasiveness but is generally less accurate than tower-mounted and head-mounted eye-
trackers (Li, Winfield & Parkhurst, 2005). 
A remote video-based corneal reflection eye-tracker will be used in this study.  
 
2.8.2 Eye-tracking applications 
Applications using eye-tracking can broadly be classified as either diagnostic or interactive. Within the field of 
HCI, diagnostic applications are used to determine and record the eye gaze of the user for post-trial 
assessment and analysis (Jacob & Karn, 2003). In these instances the system is not required to react to the 
perceived eye gaze (Duchowski, 2002). In contrast, HCI interactive systems use eye gaze as an input modality 
and the system is required to respond to the eye gaze in an appropriate manner (Duchowski, 2002).  
Duchowski (2002) subdivides interactive applications into either selective or gaze-contingent applications. 
Selective applications use eye gaze as an input device, specifically in terms of a pointing device. Gaze-
contingent applications use the eye gaze information to facilitate rapid rendering of a complex display as the 
information in the peripheral vision and extending beyond that is degraded so as not to consume resources 
(Duchowski, 2002). Since information in peripheral vision is not processed it seems reasonable to suppress this 
information to lessen the load on the users’ cognitive perception. In this way, it may be possible to provide 
users with a frame of reference whereby they are certain that the area they are fixating on is correctly tracked. 
However, the size of objects could mean that there may be multiple objects in the gaze-contingent window, 
which fails to give concrete affirmation to the user that the correct button or object will be manipulated, just 
that the general area is correct.  
28 
 
Chapter 2 Theoretical Background 
For the current study eye gaze will be used in an interactive, selective capacity. 
 
2.8.3 Activation mechanisms 
Eye movement-based human-computer interaction can be classified as either requiring natural or unnatural 
movements (Jacob & Karn, 2003). Interfaces using natural eye movements respond to eye movements which 
are natural to the user, such as normal scanning of an interface (Jacob & Karn, 2003). Unnatural eye 
movements are those that must be executed in a specific way in order to elicit reaction from the system (Jacob 
& Karn, 2003), such as gaze gestures. A number of different mechanisms for eye gaze activation will be 
discussed with some examples being given of their use in other applications. 
 
2.8.3.1 Dwell time 
“At first, it is empowering to be able simply too lko at what you want and have it happen, rather than
having to look at it (as you would anyway) and t hpeonint and click it with the mouse or otherwiseu ies sa 
command. Before long, though, it becomes like thied aMs Touch. Everywhere you look, another 
command is activated; you cannot look anywhere owuitt hissuing a command. The challenge in building a 
useful eye tracker interface is to avoid the MiTdoaus ch problem.” (Jacob, 1991, p. 156) 
The most natural method to trigger a response from a system via eye gaze is that of dwell time as it is highly 
intuitive and requires no training (Stampe & Reingold, 1995). Dwell time is the duration of a fixation, or the 
length of time that the user must continuously gaze at an object, in order to trigger a response (Jacob, 1991). A 
drawback of using dwell time is that it can soon escalate into the aptly named Midas touch problem (see quote 
above). This problem can be overcome by lengthening the dwell time required to activate commands or using 
a secondary action such as a mouse click or button press (Ashmore, Duchowski & Shoemaker, 2005). However, 
if the dwell time is too long then the speed advantages of using eye gaze are lost. Additionally, the user may 
become frustrated at having to maintain a stable gaze for a protracted length of time.  
The ideal solution would be to react to eye gaze when appropriate but also allow the user to glance at the 
interface without activating commands when they so desire (Jacob, 1991). Subtle feedback can also be given, 
for example highlighting a button when it is about to be activated rather than simply executing the command 
suddenly without first giving the user some form of feedback (Jacob & Karn, 2003).  
The Risø National Laboratory at Roskilde in Denmark developed a system called EyeCatcher which makes use 
of EyeCons. Glenstrup and Engell-Nielsen (1995) report on the use of EyeCons in their thesis as a successful 
solution to the Midas touch problem. These are gaze sensitive buttons that are placed next to a selectable 
area. Users are required to fixate on the EyeCon to activate it and by placing the EyeCon next to the selectable 
area the risk of the user accidentally activating the command is reduced. As an added measure, continuous 
feedback is given to users to enable them to judge when the activation will occur. An animation of an eye 
closing (Figure 2.4) is played on the EyeCon and only once the eye is closed, is the action triggered. For 
inexperienced users, the optimal dwell time for EyeCons was found to be 500 milliseconds while the dwell 
time could be set shorter for more experienced users, but was still dependent on the individual. 
 
Figure 2.4: EyeCon animation of eye closing 
Source: Glenstrup and Engell-Nielsen(1995) 
No reasons were given as to why an animation of an eye blink was used for dwell time. A colour gradient or 
some other indicator without such strong connotations may be more usable. 
29 
 
Chapter 2 Theoretical Background 
Typically, dwell times range from 400-1000 milliseconds (Špakov & Miniotas, 2003) and oftentimes are 
accompanied by an option to change the dwell time to better suit the needs of the user. This requires that 
users set the dwell time according to their perceived needs. However, it is possible to continually evaluate the 
speed with which users’ type with their eyes and adjust the dwell time accordingly (Špakov & Miniotas, 2003). 
This method proved popular and alleviates the need for the user to adjust the dwell time until a suitable time 
is found for their individual use. These finding were confirmed during experimentation with dwell time periods 
where it was found that for novice users the dwell time should be longer than 500 ms but that this time should 
be adjustable and adaptive (Hansen, Johansen, Hansen, Itoh & Mashino, 2003). Furthermore, the dwell time of 
a button should be dependent on the button and the amount of time required to perceive the button – for 
example, does it contain images, full-length text or just letters (Hansen et al., 2003). Therefore, the amount of 
information on the button that must be processed affects the interaction technique directly as users will 
require more time to process the information before a decision can be made as to whether the button must be 
selected. Once the users become accustomed to the interface, recognition will play a role in deciding which 
button is required and the dwell time can be significantly reduced. There are also various methods of 
implementing dwell time such as continuous or accumulated dwell time (Hansen et al., 2003).  
The multimodal interface in the current study will provide dwell time capability although it will not be included 
as a factor in the study. Additionally, since the aim of the study is not to determine optimal dwell times for the 
task at hand, users will be permitted to adjust the dwell time to suit their individual needs. This will also 
provide better control for the users who will be able to determine for themselves how much time they need 
before activating the key. Since this activation method will primarily be used for typing and the keyboard 
buttons will contain only a single letter, the amount of time required to process the button test should be 
negligible. Empirical testing of the dwell time will not be done as many such studies have already been 
conducted. 
 
2.8.3.2 Blinking 
Buttons and other widgets can also be selected by responding to blinking. However, blinking is not necessarily 
the ideal solution to the Midas touch problem as blinking is not always voluntary (Ashmore et al., 2005) and 
the rate of blinking is also affected by the user’s workload (Jacob & Karn, 2003).   
As an example, Špakov (2005) showed that chess pieces can be moved by first selecting the piece and the 
destination square using one of three selection modes, namely dwell time, blinking and gaze gestures. Of 
these, dwell time seemed to be the most attractive to the users, while blinking and gaze gestures were 
considered to be tiring. The fact that blinking is considered to be tiring could be due to the natural reaction of 
the users to try and stop blinking as they are nervous that they may inadvertently activate a command. The 
effort required to stop blinking and the resultant eye fatigue caused by not blinking may prohibit blinking from 
becoming an acceptable means of communication. Although similar to any other interaction technique, 
practice and extended exposure could eliminate the problem. An additional consideration for using blinking is 
the frequency of the blinking that will be required since blinking may be useful if used in moderation. 
To provide more options to the users, blinking will be incorporated as an interaction technique in the 
multimodal interface of the current study. However, in an effort to overcome a user’s urge to suppress 
blinking, a pronounced blink will be required in order to activate targets. This means that both eyes must be 
kept closed for a protracted period of time. This should allow distinction to be made between natural blinking 
and a command blink. The time required for the pronounced blink should ideally be much shorter than the 
threshold used for activation through dwell time to facilitate faster selection speeds. Users will be allowed to 
change the duration for which the blink must be executed as their expertise and needs change. The testing of 
the blinking interaction technique is beyond the scope of the study although it may be worthwhile to test the 
usability and user reaction to the pronounced blink. 
30 
 
Chapter 2 Theoretical Background 
2.8.3.3 Look-and-shoot 
When using the look-and-shoot method of eye gaze interaction (De Luca, Weiss & Drewes, 2007), the user is 
required to gaze at an interactive object whilst simultaneously pressing a button in order to trigger a system 
response (Ware & Mikaelian, 1987). Using such an activation method has the advantage that the time lost to 
dwell time activation is no longer applicable and the full benefit of high-speed eye movements can potentially 
be exploited. It can also be considered analogous to normal mouse use, where the user has to click a mouse 
button to elicit a response from the system. While pressing a keyboard key requires some form of automotive 
control, look-and-shoot can use any form of trigger, such as blowing in a pipe or issuing a speech command.  
Although look-and-shoot (using the Enter key) will be available for use in the multimodal interface of this 
study, the use thereof will not be included in the user testing. The proposed interaction technique of eye gaze 
and speech is similar to look-and-shoot and will be discussed in more detail in a following section. 
 
2.8.3.4 Gestures 
Analogous to mouse gestures, the use of eye gestures have been suggested as a means of communication with 
the interface via eye gaze. In terms of applying this concept to eye gaze, gestures require users to perform a 
predefined set of eye movements that can be interpreted by the system as a command being issued. The use 
of gestures is a fairly novel idea in terms of eye gaze but it has already been implemented with some success 
both for numeric (cf. De Luca et al., 2007) and alphabetic input (cf. Huckauf & Urbina, 2008; Porta & Turina, 
2008; Wobbrock, Rubinstein, Sawyer & Duchowski, 2008). Gestures are not ambiguous and there is less risk 
that a command be accidently executed when using a gesture, as opposed to dwell time and blinking. 
Gaze gestures have been used to input numerical PINs and were found to be more robust against erroneous 
input although they did require more time than a standard keypad entry (De Luca et al., 2007). Such an input 
technique could increase the security of PINs as gaze gestures will be difficult to detect by an observer as no 
visual feedback will be given. Forty percent of participants stated that they would be able to use mnemonic 
aids, such as a shape traced on a numeric keypad, to remember their PIN gaze gestures (De Luca et al., 2007). 
Unfortunately, a PIN contains far fewer digits than the alphabet and it remains to be seen whether the same 
mnemonic benefit can be gained from alphabetic input. EyeWrite (Figure 2.5) is an example of an application 
which makes use of alphabetic gaze gestures. A gaze sensitive application interprets gaze gestures as 
alphabetical characters and then relays these to any standard Windows text editor (Wobbrock et al., 2008).  
Users of EyeWrite did require some time to become accustomed to the use of the system and while they could 
not attain the same input speeds as when using on onscreen keyboard, there were significantly fewer 
uncorrected errors in the transcribed text (Wobbrock et al., 2008). Users also preferred using EyeWrite to an 
onscreen keyboard as they perceived it to be faster, less error-prone and less fatiguing on the eyes (Wobbrock 
et al., 2008). 
Keyboard shortcuts can significantly reduce the time required to complete a task. Whilst observing a number 
of users during a document handling task, it was noted that the majority of users remove their hands from the 
keyboard to save the document using the toolbar shortcut (Drewes & Schmidt, 2007). This presented an ideal 
situation to remove the added effort of using the mouse to invoke a command by making a gaze gesture 
available. Subjects were immediately able to execute the gaze gesture to save the document but indicated that 
they would prefer to use the keyboard shortcut as an alternative to using the mouse. When closing a dialog 
box using gaze gestures, users were able to perform this task in the same time as it would take when using a 
mouse. Gaze gestures were also found to be reliable regardless of the background, including such complex 
backgrounds as tables, text or web pages. 
 
31 
 
Chapter 2 Theoretical Background 
 
Figure 2.5: EyeWrite being used with Microsoft Notepad 
Source: Wobbrock, Rubinstein, Sawyer and Duchowski (2008) 
 
Gaze gestures have even been implemented on the limited screen space of mobile phones – a novel idea 
which users found attractive (Drewes, De Luca & Schmidt, 2007). A promising finding for the adoption of gaze 
gestures is that users have displayed an uncanny ability to switch between natural eye movements and eye 
movements that they surmised would elicit a response from the system (Hyrskykari, Majaranta & Räihä, 2003). 
When using iDict, a system developed by Hyrskykari et al. (2003), the system automatically determines if the 
reader is having difficulty comprehending a part of the text based on their reading pattern. Users quickly 
realised, without prompting, that by manipulating their eye gaze they could “force” the system to provide 
assistance (Hyrskykari et al., 2003). 
The recent interest in gaze gestures has yielded some very encouraging results both in terms of user 
acceptance and objective usability measures. Since users have been able to adopt the use of gaze gestures 
with relative ease, gestures have become an exciting prospect albeit one which still requires in-depth 
investigation. Gaze gestures were considered as an activation mechanism in this study but due to time 
constraints they were eventually not included in the multimodal interface. 
 
2.8.3.5 Pupil size 
As most eye-tracking devices automatically measure and record pupil size during interaction, it could be 
suggested that this is an alternative means to influence interaction. It has been found that subjects are able to 
voluntarily control their pupil size (Ekman, Poikola, Mäkäräinen, Takala & Hämäläinen, 2008). However, it 
remains to be seen if this is a viable interaction method as the type of task currently being completed might 
also influence the pupil size of users. Pupil size was not included in the study as an interaction technique due 
to the potential inaccuracy of measuring pupil size with an eye-tracker as well as the learning which will be 
required for users to control their pupil size. 
 
  
32 
 
Chapter 2 Theoretical Background 
2.8.4 Using eye gaze in user interfaces  
“… to load the visual perception channel with a omr octontrol task seems fundamentally at odds with 
users’ natural mental model in which the eye seeasr cfohr and takes in information and the hand preosd uc
output that manipulates external objects. Othenr  tfhoar disabled users, who have no other altern,a tive
using eye gaze for practical pointing does not arp ptoe be very promising” (Zhai, Morimoto & Ihde, 
1999, p. 247) 
“Human beings look with their eyes and typicallyh, ewn they want to point to something, they look bree fo
they point (citation). Therefore, using eye gaze as a way of pointin ga  ocnomputer seems like a natural 
extension of our human abilities” (Kumar, Paepck We &inograd, 2007, p. 421). 
The most trivial interactive use of an eye-tracker in HCI is to substitute the mouse with the eye-tracker and use 
the eye gaze of the user to determine the movement of the mouse cursor and to execute clicks (Jacob & Karn, 
2003). The above quotes represent the conflicting views of researchers in the field of eye gaze technology. In 
the ensuing discussion it will be seen that eye gaze has been successfully implemented as an input channel in 
diverse environments ranging from gaming and virtual reality to so-called EyePliances and text editors. 
 
2.8.4.1 Replacement of the cursor 
Naturalness is one of the main aims of the next generation interfaces (Jacob, 1993a) and without a doubt it 
can be said that object manipulation using a mouse is far more natural than using eye gaze (Hyrskykari, 1997). 
Moreover, all mouse-controlled functions, namely clicking, double-clicking, right-clicking, dragging and 
releasing, must be able to be executed (Su et al., 2005) – actions for which the eyes have no natural 
mechanism.  
A seemingly inconsequential but important design aspect is whether or not to allow the mouse cursor to be 
slaved to the eye gaze. If eye-trackers were 100% accurate then the mouse cursor would always be stationary 
on the retina, thereby becoming completely imperceptible (Jacob, 1995a; Jacob, 1995b). However, few eye-
trackers, if any, are capable of such precision, rendering this particular facet of the problem unresolved. It 
does, however, present a secondary problem in that the cursor may be slightly offset from the centre of the 
gaze, thereby drawing the attention of the user and causing them to look at the cursor which would in turn 
cause the cursor position to change accordingly creating a scenario where the user effectively chases the 
cursor (Jacob, 1995a; Jacob, 1995b). Furthermore, the movement of the cursor as it follows the eye would be 
prolific and fairly distracting to the user (Jacob, 1995a; Jacob, 1995b).  
The afore-mentioned arguments, together with other reasons (cf. Jacob & Karn, 2003; Jacob, 1995a), leave 
many people wondering why it is necessary to explore the idea of eye gaze controlled interfaces. Firstly, it may 
improve utilisation of the bandwidth from user to computer which, in WIMP interfaces, is under-utilised to a 
large degree (Jacob, 1993a). Secondly, in order to meet the obligation of providing an equivalent experience to 
all users regardless of their abilities, it is necessary to develop interfaces for users with motor disabilities but 
who still retain full control of their oculomotor facilities. Thirdly, it may increase the speed of interaction since 
eye movement is faster (Hyrskykari, 1997) and consequently target acquisition may be faster than with a 
mouse (Oyekoya & Stentiford, 2006). Fourthly, since it is natural to look at a target before attempting to select 
it, one could assume that the eye was moving to and acquiring the target. Therefore, it could just as well be 
used for input purposes without having to require an additional manual movement which some feel could 
provide more natural communication with a computer. 
These contradictory views leave room for exploration of the use of eye gaze and how it can best be utilised to 
improve the usability of user interfaces. A number of studies, some of which will be discussed in the following 
section, have already been conducted to provide empirical evidence to confirm or refute the theoretical 
opinions.  
33 
 
Chapter 2 Theoretical Background 
2.8.4.2 Target selection 
Fitts’ Law and its applicability to eye pointing was established by Ware and Mikaelian (1987) but refuted by 
Zhai et al. (1999) who found only low correlation between eye input and Fitts’ Law and then again confirmed 
by Miniotas (2000) who later found high correlation with a variation of Fitts’ Law. Sibert and Jacob (2000) 
established that the further away the target is, the greater the advantage of using eye gaze because the cost 
remains constant irrespective of distance. Regardless of the Fitts’ law variations, Ashmore et al. (2005) insist 
that the underlying theory behind Fitts’ Law is as applicable to eye pointing as to manual input. Therefore, 
mean selection time can be reduced either through expansion of the targets or by reducing the distance to 
said target (Ashmore et al., 2005).  
Manual And Gaze Input Cascaded (MAGIC) pointing involves warping the mouse cursor to the eye gaze (Zhai et 
al., 1999). Consequently, when the user focuses on a target, the mouse cursor is automatically moved to that 
position so that very little manual movement is required. Alternatively, the mouse cursor is positioned on the 
boundary of the eye gaze when the user initiates movement of the mouse. The mouse pointer must then 
manually be moved over the target. The former method, dubbed liberal MAGIC pointing, was faster for target 
selection than traditional manual mouse selection, while the latter, called conservative MAGIC pointing, was 
slower than manual pointing. This finding was verified in another study where direct manipulation with eye 
gaze proved to be faster with a mouse although selection speed slows over time as a possible consequence of 
eye fatigue (Sibert & Jacob, 2000).  The fact that the conservative MAGIC pointing was slower than manual 
pointing raises the question as to whether the participant looked at the mouse pointer and then back at the 
destination point before adjusting the mouse cursor onto the target. Since eye gaze and mouse movement are 
closely related it would seem highly plausible as it would be natural to glance at the mouse cursor once it 
appears and before mouse interaction commences. The distance threshold used to identify a new object 
should prevent a readjustment to the cursor position in this instance. Another explanation for the difference 
between liberal and conservative MAGIC pointing is the fact that more manual movement could be required 
for the conservative MAGIC pointing as the mouse pointer was warped to the boundary of the gaze area and 
not the calculated gaze position as with liberal MAGIC. The actual gaze position should be near to or on the 
target, thereby significantly reducing the amount of manual movement required. 
In order to negate the effect that moving the mouse has on the positioning of the mouse cursor, Drewes and 
Schmidt (2009) used MAGIC pointing with a touch sensitive mouse. When the user touches the mouse the 
pointer will move to the current gaze position. Their findings were that participants preferred using gaze to 
position the mouse rather than moving the mouse. This type of interaction could save a substantial number of 
mouse movements without placing additional strain on the eyes since a user will have to look at a target 
before clicking it. The combination of gaze and a touch sensitive mouse offers speeds that are superior to that 
of a mouse (Drewes & Schmidt, 2009). The interface did not provide feedback to the user as to where the gaze 
position was detected. Instead, the positioning of the mouse pointer at the gaze position was regarded as the 
visual feedback and the onus was on the user to verify that the correct target had been acquired before it 
could be clicked. While this could explain the increased accuracy which was achieved in the study since there 
were fewer incorrect clicks, it could also have an impact on the speeds achieved. Nevertheless, this was a 
successful combination of the advantages of the liberal and conservative MAGIC pointing by preventing 
unwanted cursor movement but still positioning the cursor at the eye gaze which reduces the amount of 
manual movement required. 
 
2.8.4.2.1 Using an ISO standard to assess a pointing device 
The International Standards Organisation ratified a standard, ISO 9241-9, for testing the speed and accuracy of 
pointing devices for comparison and testing purposes. The details of this ISO standard will be discussed in 
depth in Chapter 3, but some results will be discussed here. The first study to test eye-tracking as an input 
34 
 
Chapter 2 Theoretical Background 
device using ISO 9241-9 was conducted in 2007 by Zhang and MacKenzie (2007). This test used the multi-
directional tapping test across four conditions, namely (a) a dwell time of 750ms, (b) dwell time of 500ms, (c) 
look-and-shoot which required participants to press the Space bar to activate the target they were looking at 
and (d) the mouse (Zhang & MacKenzie, 2007). A head-fixed eye-tracking system with an infrared camera and 
a sampling rate of 30Hz was used for the study. The look-and-shoot method was the best of the three eye-
tracking techniques with a throughput of 3.78 bps compared to the mouse with 4.68 bps. Throughput (defined 
in Chapter 3) is a measurement which incorporates both speed and accuracy of use and can be used to 
measure the usability of pointing devices. 
The fact that the look-and-shoot method is the most efficient activation mechanism is not surprising since the 
selection time of a target is not dependent on a long dwell time and theoretically target acquisition times for 
all interaction techniques should be similar. The time required to press the space bar, particularly if users can 
keep their hand on it, should be shorter than the dwell time, which was confirmed by the results of the 
aforementioned study (Zhang & MacKenzie, 2007). Recommendations stemming from the study included that 
a dwell time of 500 ms seemed the most appropriate so as to avoid the Midas touch problem whilst 
simultaneously ensuring that participants did not get impatient waiting for system reaction (Zhang & 
MacKenzie, 2007). Increasing the width of the target reduced the number of errors made but had no effect on 
the throughput. Participants indicated that the high speed positioning of the eye-tracker is desirable but that it 
causes eye fatigue, dry eyes and discomfort. Since the eye-tracker used was head-fixed, neck and shoulder 
fatigue was also a source of concern for respondents. 
In a comparable study, the ISO standard was used to compare four pointing devices which could serve as a 
substitute mouse for disabled users (Man & Wong, 2007). The four devices tested were the (i) CameraMouse, 
which was activated by body movements captured via a USB web cam, (ii) a Head-Array Mouse Emulator, (iii) a 
CrossScanner, which has a mouse-like pointer activated by a single click and an infrared switch and (iv) a Quick 
Glance Eye Gaze Tracker which allows cursor placement through use of eye movement (Man & Wong, 2007). 
Targets had a diameter of 20 pixels and the distance between the home and the target was 40 pixels. Two 
disabled participants, both with dyskinetic athetosis and quadriplegia, were tested over a period of eight 
sessions with two sessions per week. Each participant was analysed separately and it was found that the 
CrossScanner was suitable for both participants although the ASL Head-Array was also suitable for use by one 
of the participants. This study is a prime example of the difficulties associated with testing disabled users. The 
disabilities are often specific to the user and wide-ranging customisation will have to be provided to ensure 
that one interface can cater for a diverse group of disabled users. The findings cannot be generalised to any 
population and also cannot serve to confirm or refute many other studies. The current study aims to provide a 
highly customisable multimodal interface which will allow a number of different interaction techniques to be 
used according to the preference and capabilities of users. However, due to the intricacies involved, disabled 
users will not be tested at this stage but deferred until after the usability of the multimodal interface has been 
established for able bodied participants. 
Since the use of the ISO standard for testing eye gaze has been established it will be used to compare the 
various selection techniques which will be discussed in the next section.  
 
2.8.4.2.2 Increasing accuracy 
An additional problem of gaze based interfaces is the size of targets that are required (Drewes & Schmidt, 
2009). For example, suppose the resolution of the screen is 1024×768 pixels and that the user sits, on average, 
60 centimetres from the screen, then the minimum size of the targets will have to be approximately 31 pixels 
which is larger than the standard widgets in current windowed environments, which are generally 24×24 pixels 
in size. Therefore, most standard GUI elements are less than one degree visual angle in size. The ribbon 
concept which has recently been adopted by Microsoft to replace standard menus may offer some hope as the 
35 
 
Chapter 2 Theoretical Background 
majority of the icons are now much larger than in previous toolbars. However, they are closely spaced and 
may still present a challenge to select accurately with eye gaze. Additionally, the most common tasks in 
Microsoft Word such as justification and formatting still use the smaller icons. However, the common tasks all 
have keyboard shortcuts and can also easily be accommodated in a speech grammar, which was done in this 
study. 
The most natural response to this problem would be to increase the size of onscreen targets. This, however, 
creates the problem that much more screen real estate is used for onscreen widgets – leaving less room for 
the working area of the user. These designed interfaces are often viewed as unnatural and are rarely used by 
users other than the disabled (Špakov & Miniotas, 2005). To counteract both the impact on available screen 
real estate and to exploit the properties of Fitts’ Law, several target expansion mechanisms have been 
proposed and implemented for both eye pointing and manual input (Ashmore et al., 2005). These include 
expansion of the target, expanding or zooming into the entire display uniformly or expanding a portion of the 
display through the use of a fisheye lens (Ashmore et al., 2005). The following sections will discuss these 
suggestions, a number of which will be tested in a comparative investigation during the current study. A 
comparison will also be drawn with the mouse as a pointing device. Additionally, some of these proposals will 
be included in the multimodal interface for the word processor. 
 
2.8.4.2.2.1 Expansion and magnification of targets 
Expansion of the targets can be either visible or invisible, implying the user is not aware of the expansion. The 
idea behind invisible expansion is to create a larger selection area around the target without visual feedback. 
This allows room for error and slight displacement of the eye during target selection. To illustrate a concrete 
example (Figure 2.6), Miniotas, Špakov and MacKenzie (2004) investigated target selection through invisible 
target expansion. In the experiment, the target was displayed to the user as a 20×20 pixel square icon (dark 
solid square) when in reality the selection area was a 120×120 pixel area (demarcated with a dotted line). 
 
120 pixels 
 
120 pixels 
 
Figure 2.6: Invisible expansion of targets  
 
Target selection was found to be both faster and more accurate with invisible target expansion. Visual 
feedback was provided by highlighting the target when selection was acquired. As the spatial cost of such a 
design is permanent (Miniotas et al., 2004), it has limited applicability in standard interface elements where 
the expanded target area will overlap adjacent interactive targets, such as on a toolbar. Additionally, using 
invisible target expansion does not reduce the amount of screen real estate that is required. It may however, 
have the advantage that users will have more confidence as the visually small targets will be easier to acquire 
and select. 
A gravitational well is similar to invisible expansion of the target as the cursor is automatically pulled towards 
the target once the cursor is close enough to the target (Keates et al., 2002). Testing of a gravitational well 
indicates that users noticed the presence thereof and started to rely on this feature, thus freeing themselves 
36 
 
Chapter 2 Theoretical Background 
to concentrate more on moving the cursor to the desired target (Keates et al., 2002). Additionally, the 
gravitational well substantially improved the selecting prowess of the users (Keates et al., 2002).  
Invisible expansion of the target, specifically in the form of a gravitational well, will be used in this study to 
facilitate a larger selection area of the target without having to visibly increase the size of the targets. This 
method was chosen since it may serve the dual purpose of making target selection easier and simultaneously 
serve to boost the confidence and satisfaction of the user who will be able to select smaller targets with 
relative ease. Moreover, interface elements will not be oversized, adding to the aesthetic appeal of the 
interface.  
An alternative to static expansion is dynamic expansion of the targets whereby the target is physically 
expanded to a more click-friendly size when users indicate their intention to interact with the target (Miniotas 
& Špakov, 2004). Ashmore et al. (2005) compared an omnipresent fisheye lens, a fisheye lens which appears 
only after a fixation is detected (a MAGIC fisheye lens), a Grab-and-Hold (GHA) fisheye lens, which works much 
the same as the MAGIC lens but is fixed in place after fixation onset, and no lens. A fisheye lens shows the area 
in proximity to the gaze in great detail, usually through magnification, while areas in the periphery are 
degraded systematically (Furnas, 1986). The GHA lens counteracts the effect of jitter since it remains fixed 
once it is activated and allows the user to look around the magnified area without the lens moving. Using dwell 
time it was found that the GHA and MAGIC lenses led to significantly faster selection times than when 
selection is completed with no lens or with the omnipresent lens (Ashmore et al., 2005). When measuring 
accuracy of selection it was found that the MAGIC and omnipresent lens improved selection accuracy. 
Additionally, the use of a grab-and-hold algorithm when used in conjunction with invisible expansion resulted 
in a significant decrease in errors (Miniotas et al., 2004). Unfortunately, no comparison was made to a manual 
selection technique so while the MAGIC and GHA lenses were superior to the other two, their performance 
with manual selection techniques cannot be verified. The current study will provide a more comprehensive 
comparison of selection techniques by requiring selection with manual and the proposed multimodal 
interaction technique. 
Miniotas and Špakov (2004) express scepticism as to the viability of dynamic expansion due to a number of 
considerations. Firstly, eye movement alternates between fixations and saccades, the latter of which are 
mostly ballistic in nature and during which all visual perception is suppressed (Miniotas & Špakov, 2004). 
Therefore, it would be wasteful to expand targets during saccadic motion. Secondly, the presence of fixational 
eye movements means that the eyes are never really completely at rest which Miniotas and Špakov (2004) 
take to mean that a visually expanding target would distract a user. Lastly, the eye gaze serves to replace the 
cursor which means that cursor movement cannot be tracked in order to determine the area that must be 
expanded (Miniotas & Špakov, 2004).  
The above-mentioned reasoning of Miniotas and Špakov (2004) arguably has merit but not enough to 
summarily dismiss the notion of dynamic expansion; rather they should be used to adapt dynamic expansion 
accordingly. For instance, replacement of the cursor with the eye gaze simply means that the area situated 
around the current eye gaze position will be expanded. There is also no need for expansion during saccadic 
motion; therefore the detection of a fixation could trigger the target expansion. The drawback of such a 
method is that the expansion of an area may cause its position to move relative to its actual position on the 
screen. This could be disruptive to users as they would then have to readjust their gaze to suit the positioning 
of the expanded area. Therefore, careful consideration must be given to whether expansion should only be 
triggered after a fixation is detected. Furthermore, smoothing and stabilisation algorithms can be used to 
avoid jumpy visual feedback. All of these factors can be negated by expanding not only a single target at a 
time, but rather a realistic area, for example a 200×200 (≈6.5°) pixel area as suggested, around the current 
fixation point, after which the user will be required to confirm the selection by fixating on one of the expanded 
targets. This may require more effort as a result of the repositioning of the eye gaze to acquire the expanded 
area.  
37 
 
Chapter 2 Theoretical Background 
Taking the cautionary advice of Miniotas and Špakov (2004) as well as the above-mentioned proposed 
solutions into account, dynamic expansion of the interface was built into the multimodal interface of the 
current study. A third-party application, capable of magnifying the area under the mouse cursor was used for 
this purpose. The magnification tool can be switched on or off as the needs of the user change, but when 
switched on it will be omnipresent until it is switched off. The magnification will be slaved to the eye gaze of 
the user, therefore wherever the eye gaze is currently detected, the area directly below that will be magnified 
whether the gaze is stationary or not. This will circumvent the need for readjustments to the eye gaze on the 
magnified area as the magnification will always be slaved to the eye movement and the user will not be aware 
of the spatial shift in relation to the non-magnified area. A smoothing and stabilisation algorithm (section 
3.3.5) will be used to avoid jerky movement of the magnified area. The tool used for magnification provides an 
adjustable zooming factor and display window which provides further customisation for the user who can then 
adjust the zooming factor as required. The default zoom factor is set to enlarge the area to double its actual 
size within a 400×300 window. In order to preserve the naturalness of the interface and the ease with which a 
user can read the contents of a document in a word processor, a fisheye lens which degrades the information 
in the periphery, was not considered for inclusion in the interface. 
 
2.8.4.2.2.2 Zooming the entire display 
Another means of increasing accuracy of eye gaze pointing is to magnify the entire display in response to the 
position of the eye gaze. A drawback of this approach is that contextual information beyond the zoom region is 
lost (Ashmore et al., 2005). Positive results have, however, been achieved for target acquisition in a word 
processing application as well as a web browser using eye gaze and entire display zooming (Bates & Istance, 
2002). Zooming of the entire display will not be used in this study. 
 
2.8.4.2.2.3 Applicability to the current study 
The method of text entry which will be proposed for this study entails that eye gaze be used in the capacity of 
a pointing device. Furthermore, one of the integral interaction means within a word processor is the selection 
of a target, such as an icon or a menu item. Therefore, it is imperative that the proposed interaction technique 
of eye gaze and speech be investigated for its usability as a pointing device and as a technique to select 
targets. 
A number of methods with which the accuracy of pointing and selecting could be increased, were discussed in 
the previous sections.  To facilitate accurate pointing and clicking, the nature of eye gaze requires that screen 
widgets often have to be larger than the standard, but the resultant loss of screen real estate and oversized 
targets could make the interface undesirable for users. Therefore, other techniques such as zooming of targets 
and magnification are suggested. 
For the purposes of the current study, a gravitational well will be employed in order to increase the selectable 
size of the targets without causing a visible increase in target size. While this should increase the accuracy of 
eye gaze as a pointing device, it will also boost the confidence of the users as they will feel that they are 
mastering eye pointing without added enhancements to the interface. Omnipresent magnification, with a 
smoothing and stabilisation algorithm, will be used to overcome the limitations and problems associated with 
zooming of the interface. 
As far as could be ascertained, there is no study where a direct comparison between invisible and visible 
expansion of the target and magnification was included. Therefore, the current study will address this 
shortcoming by comparing the differences in usability with and without target expansion and magnification as 
well as the differences between these methods. In order to overcome identified shortcomings in previous 
38 
 
Chapter 2 Theoretical Background 
studies, these methods will also be compared to the mouse as a pointing device. In order to control for 
external variables as much as possible, this testing will not be conducted in the word processor but will rather 
make use of the ISO tests as previously discussed. The complete experimental design will be discussed in 
section 3.4.2.2. 
 
2.8.5 Gaze-based user interfaces in practice 
As previously discussed, gaze can be used in user interfaces in two ways, namely for selective or gaze-
controlled interfaces and for gaze-contingent interfaces. A gaze-controlled system typically facilitates the use 
of eye gaze as a selection technique, while a gaze-contingent interface responds to the eye gaze by providing 
informative information at the point of the gaze but degrading the information at the periphery of the gaze in 
some way (Duchowski, 2002; Rayner, 1998). The following sections will provide an overview of some of the 
gaze-based applications which are available, both in general applications and more specifically for text entry. 
 
2.8.5.1 Eye typing 
Eye typing is a means whereby text is output using eye gaze as an input (Majaranta, MacKenzie, Aula & Räihä, 
2006). According to Majaranta & Räihä (2007) text entry methods can be classified into distinct groups based 
on the input technique which is used.  The first category uses direct gaze pointing, where the user has to 
“press” keys on an onscreen keyboard. The second category utilises eye switches, the third requires discrete, 
consecutive gaze gestures and the final group continuous gaze gestures. 
An example of an eye typing application which requires direct gaze pointing is GazeTalk, which also employs a 
smaller keyboard which predicts the six most likely letters to be needed based on the previously typed letters 
(Hansen, Hansen & Johansen, 2001). GazeTalk uses dwell time as a selection method. A novel direct gaze 
method, called context switching, was developed to overcome the Midas Touch problem associated with eye 
typing (Morimoto & Amir, 2010). Context switching makes use of two keyboards and requires two eye 
movements in order to achieve key-focus and key-selection. Key-focus is achieved through the use of a short 
dwell time and then in order to type the focused key, that is to activate key-selection, the context must be 
switched. This implies that the user must perform a saccade into the other keyboard or context. Context 
switching resulted in faster typing speeds than simple dwell time with an average typing speed of 12 words per 
minute for participants after eight sessions with the context switching system (Morimoto & Amir, 2010). A 
major drawback of this type of typing system is of course, the space that is used in order to display two 
keyboards in a single work area. 
Blinking, winks or coarse eye movements can be used as eye switches in eye typing systems. Examples of such 
systems are the “eye-switch controlled communication aids” (Ten Kate et al., 1979) and I4Control (Fejtová et 
al., 2004). EyeWrite (Wobbrock et al., 2008), as discussed in section 2.8.3.4 is an example of a gesture-based 
eye typing system which uses discrete gestures.  
The gaze-enabled Quikwriting application uses continuous gaze gestures. All characters are placed within an 
inner resting area but grouped in such a way that the location indicates the gesture required (Bee & André, 
2008). The cursor must be moved to successive sections in order to indicate which character is to be typed. 
When the user returns their gaze to the centre resting area, it signifies the end of the gesture and the 
associated character is typed. 
The problems associated with eye typing are threefold. Firstly, eye typing using an eye-tracker generally makes 
use of a full-size onscreen keyboard which covers a large part of the screen. The impracticality of this will 
perhaps ensure that eye typing will never become a mainstream activity for computer users (Isokoski, 2000). 
39 
 
Chapter 2 Theoretical Background 
Secondly, eye typing is fundamentally incapable of achieving the same speeds as keyboard typing which can 
process parallel key presses at a high speed, such as in the case of a touch typist (Stampe & Reingold, 1995; 
Isokoski, 2000). For example, if dwell time is used, the speed of typing is restricted by the dwell time 
(Majaranta et al., 2006), which will probably never match the speeds that can be attained using a keyboard. 
Thirdly, eye typing compels the user to look at the keyboard, thus preventing them from looking at the text 
and typing simultaneously which can easily be achieved by most typists using a keyboard input device 
(Isokoski, 2000).  
Solutions to the first of the afore-mentioned problems could be to provide keyboards with varying key sizes, 
where the commonly used keys are larger than the lesser used keys (Istance et al., 1996). Alternatively, the 
keys can expand dynamically as they receive focus (Istance et al., 1996). Another consideration to minimise the 
amount of screen real estate used by the visual keyboard is to use a cluster keyboard, similar to cell phone 
keypads which use predictive algorithms such as T9. These cluster keyboards partition the letters of the 
alphabet onto several keys, where each key contains more than a single letter. The cluster keyboard can be 
used by selecting the key which contains the desired letter only once, regardless of its position on the key 
(Klarlund & Riley, 2003). When the spacebar is pressed to indicate the end of the word has been reached, the 
sequence of the keys is translated into a list of words which have that sequence of digits (Klarlund & Riley, 
2003). If there is more than one word with that sequence then the most probable word is selected (Klarlund & 
Riley, 2003). This suggested solution could serve to simultaneously reduce the amount of screen real estate 
required as well as speed up text entry since fewer keys have to be inspected and the desired key only has to 
be selected once. 
Off-screen targets can also be used to save screen space by spreading the letters around the sides of the 
screen instead of as a bulk keyboard (Isokoski, 2000). Once the user glances off the screen, the eye-tracker is 
still able to track it to a certain degree albeit with a much degraded signal (Isokoski, 2000). The screen can also 
be divided into zones and gestures can be interpreted as letters, as in the case of QuikWriting (Isokoski, 2000).  
The implementation methods suggested for off-screen targets require a fair amount of learning on the part of 
the user before they will be able to use the system, therefore they don’t appear to be the most ideal solution. 
Another way to save screen space is to make use of a scrollable keyboard where only a portion of the keyboard 
is visible at any given moment and the user must scroll the keyboard to access other letters (Špakov & 
Majaranta, 2008). While screen real estate is gained, the more users have to scroll and the slower their 
resultant typing speed becomes. Špakov and Majaranta (2008) also found that typing speeds varied from an 
average of 15.06 wpm for a full keyboard, 11.12 for a 2-row scrollable keyboard to 7.29 wpm for a 1-row 
scrollable keyboard. These speeds were achieved after 8 sessions using the scrollable keyboards. However, 
participants were all experienced with eye typing before the commencement of the study. The amount of 
experience and exposure was, however, not specified and therefore it is difficult to gauge how comparable 
these results will be with the current study. Another shortcoming of the Špakov and Majaranta (2008) study is 
that no indication is given of the learning curve which will be experienced by the general users who have not 
had prior experience with eye typing. 
In order to speed up the typing process, it is possible to implement word completion algorithms which will 
eventually reduce the total amount of typing time (Isokoski, 2000). Early forays into text entry mechanisms for 
the disabled arranged letters in order of their frequency of use (Istance et al., 1996), which would necessitate 
the need to adapt to a constantly changing keyboard layout which could confuse the users. Instead of 
enlarging common keys as suggested previously, the keys used most regularly by the current user could be 
enlarged. This would be especially helpful for different languages as the frequency of characters differs 
between languages.  Another mechanism which could minimise eye fatigue is to make the layout dynamic and 
rearrange the letters according to their probability of next use. Once again, this could hamper the user more as 
the layout of the keyboard would constantly change and the user may have to search for the required letter. It 
40 
 
Chapter 2 Theoretical Background 
could be expected though that the user would very quickly become familiarised with this practice and at least 
know where to locate a list of probable characters. 
When using dwell time, the dwell time can be automatically adjusted according to the capabilities of the user 
(Majaranta, Ahola & Špakov, 2009). Using an adjustable dwell time, Majaranta, Ahola and Špakov (2009) found 
that typing speeds improved from 6.9 wpm on average to 19.9 wpm while at the same time the dwell time 
decreased from 876 milliseconds to 282 milliseconds. There was also a marked improvement in the error rate. 
These findings bode well for the learnability of eye typing using dwell time since the usability measurements 
all improved over time. 
Based on the motivation that the majority of quadriplegics and paraplegics sustain their injuries in their late 
teens or early twenties and thirties, Miniotas, Špakov and Evreinov (2003) developed a system of eye typing 
that is founded on the Latin cursive style of writing since by then users would have learnt to write in cursive. 
The system, called Symbol Creator, uses the fact that a set of basic elements can be identified into which most 
of the characters can be decomposed into. These basic elements were then used to create a limited set of 
segments which can be combined into cursive letters. An eighth symbol was added to signal the end of a 
character. Users must piece together the segmented symbols in order to type a cursive letter. The system 
provides hints to the user as to which key is required to complete a character and disables those keys that 
cannot be used in conjunction with the previously selected key. The system also provides for the all-important 
feature of feedback by highlighting keys when they are activated by the eye gaze and then highlighting them in 
a different manner when the dwell time expires to indicate to the user that the key has been selected.  
A second text entry method was also incorporated into Symbol Creator which used the idea of the cluster 
keyboards found on cell phones (Miniotas et al., 2003). Results indicated that the cluster and the Symbol 
Creator system have mean entry times of 9 words per minute with a very low error rate. Overall the users 
preferred using Symbol Creator to the cluster keyboard (Miniotas et al., 2003). Additionally, very little screen 
real estate was used and only horizontal saccades were required as all the keys were situated in a single line. 
The idea behind the horizontal saccades is an interesting one as it could substantially lessen the fatigue 
experienced by users required to manipulate the mouse pointer with their eyes. The scrollable keyboard will, 
to some extent, mimic the need for horizontal saccades as in some instances there is only one row of 
characters on the keyboard at any time. The speeds achieved with Symbol Creator are between the speeds of 
the 2-row and the 1-row scrollable keyboard. 
Perhaps it will also be possible to reduce eye fatigue by placing the enabled symbols which can be used 
together at a specific position on the keyboard, as previously suggested for onscreen keyboards. The user will 
soon learn where the cluster of allowable keys is situated and it is only when a new character is started that a 
larger search will be needed to locate the start symbol. Unfortunately, the researchers did not report on the 
mental effort or memorisation that was required for Symbol Creator as this could have an impact on the 
speeds achieved. Even though older users may have knowledge of cursive writing, the segmentation process is 
not a natural means of writing and may require some cognitive processing. In this regard, the disabling of 
invalid symbols may provide invaluable assistance to the user.  
To overcome the problem that users have to look at the keyboard, confirmation feedback can be given to the 
user so that they are aware when the character has been typed. This will reduce their need to look at the 
document.  Eye typing speeds have been shown to be higher when both visual and auditory feedback is given 
with users achieving speeds of 7.55 words per minute after four sessions (Majaranta, 2009). Therefore, both 
these feedback mechanisms will be used in the application for the current study.  
Visual feedback of eye gaze position is also an important aspect of eye typing, both to indicate selection and 
focus as it impacts on accuracy, typing speed, gaze behaviour and subjective satisfaction (Majaranta et al., 
2006). Feedback is essential when using any type of gaze controlled interface and not only when it is used for 
41 
 
Chapter 2 Theoretical Background 
typing (Drewes & Schmidt, 2009). However, as previously discussed, using a gaze pointer is challenging since 
any inaccuracy may cause the user to chase the pointer and the pointer could also have jerky movement.  
Eye typing is of paramount importance to the study at hand since typing with the multimodal interface is one 
of the key aspects of the study. Some of the problems which were identified in the discussed studies will be 
circumvented in this study through various means. The issue of wasted screen real estate remains a problem 
but the onscreen keyboard provided in this study will have adjustable keys, meaning that users can increase or 
decrease the size of the keys as it suits their needs. The size of the keyboard will fluctuate with the size of the 
keys thereby providing the opportunity to reduce the amount of screen space that the keyboard occupies. 
Whether screen space is “wasted” on an onscreen keyboard may simply be a subjective issue and not really a 
usability issue. Therefore, questionnaires can elicit user reaction to the onscreen keyboard after the initial use 
thereof and once users have become accustomed to the new layout. Users will also be allowed to toggle 
between different keyboard layouts which will increase the customisability of the keyboard in an effort to 
increase the acceptance of the proposed application. Furthermore, since users will be required to look at the 
keyboard in order to type, audio feedback will be given when a letter is typed – thus eliminating the need to 
continually look at the document for confirmation of typed letters. So as not to distract the user, visual 
feedback of the current selection will only be given when gaze is within the bounds of an onscreen button and 
will remain stable for the entire time that the gaze is detected on the button. While dwell time, blinking and 
look-and-shoot will be provided for selection of onscreen keys, it is the combination of eye gaze and speech 
which is of interest in this study. Using verbal commands is analogous to look-and-shoot which has proven to 
be faster than dwell time selection. Since the target acquisition times should theoretically be the same, 
regardless of the activation mechanism, using speech commands could prove to be faster than dwell time. 
Therefore, by using speech, the increased selection time caused by the dwell time is avoided, the unnatural 
feeling of blinking to select a key is also avoided and at the same time, the requirement that the user is able to 
locate and press a keyboard key is no longer required. The use of speech will also mean that the need for 
double-clicking and right-clicking falls away. Commands which are reached through double- or right-clicking 
can be provided for in the speech grammar. Whether this proposed method of interaction will be able to 
achieve the typing speeds possible with a keyboard will be determined though comparative user testing. 
 
2.8.5.2 Other applications of gaze-interaction 
The Gaze Enhanced UI Design (GUIDe) project in the HCI Group at Stanford University successfully designed a 
number of applications which did not overload the visual channel and exploited the natural use of eye gaze to 
facilitate everyday computing (Kumar & Winograd, 2007). These applications were EyePoint, which facilitated 
pointing and selecting, EyeExposé which used eye gaze to switch applications and EyeScroll which scrolls 
screen content based on the reading speed of the user (Kumar & Winograd, 2007).  
EyePoint (Kumar, Paepcke &  Winograd, 2007) provided for a left and right mouse click, a double click, mouse-
over action and click-and-drag action (requires a start and end drag action) by assigning a keyboard key to each 
one of these. Users were expected to gaze at the link or button and simultaneously press the desired key to 
execute a mouse click, right-click or double-click. This then causes the immediate area around the eye gaze to 
be magnified at which time the user can focus on the magnified target and release the key to execute the 
command. Task analysis showed that EyePoint performs similarly to a mouse and can be faster than the mouse 
in some instances. However, users did report having to concentrate more when using eye gaze but since the 
tasks were strenuous and had to be completed over a short period, it is not expected to have the same effect 
in normal use. Participants did express a high preference for EyePoint. 
The fact that EyePoint could achieve speeds faster than the mouse is undeniably very promising for the use of 
eye gaze as a pointing device. The additional fact that all mouse clicks could be emulated is also very positive. 
However, the design of EyePoint necessitates that in order to use it to its full potential the user must be able 
42 
 
Chapter 2 Theoretical Background 
to press (and hold) six different keys spread out on the numeric keypad. This assumes that the user has control 
over at least one of their limbs to such an extent that they can accurately locate and press a key. For many 
disabled users this is not possible and, furthermore, EyePoint also fails to provide a completely hands-free 
environment for users with busy-hands tasks. 
A system which does provide completely hands-free interaction and which has proven to be useful for disabled 
users is the EagleEyes system of Gips and Olivieri (1996) (Figure 2.7). This system relies on the measurement of 
the electro-oculographic potential (EOG), and requires that five electrodes be positioned on the head of the 
user. The user can then control the cursor through movement of the eyes while keeping the head stationary or 
through movement of the head while keeping the position of the eyes relatively stationary or alternatively, a 
combination of both. A wide range of applications can be controlled through EagleEyes such as educational 
and entertainment software, messages can be spelled out and users can also navigate on the Internet. The 
th
EagleEyes system runs as a background application and captures eye gaze coordinates every 1/60  of a 
second. These coordinates are then treated as though they were mouse coordinates and not gaze coordinates. 
Optionally, dwell time can be used to simulate a mouse click. 
 
Figure 2.7: EagleEyes application in use 
Source: Gips and Olivieri (1996) 
The EagleEyes system was successfully developed for users with severe disabilities and can take anywhere 
from 15 minutes to a few months to master (Gips & Olivieri, 1996). One of the greatest advantages of 
EagleEyes is that it is actually a background function which is application independent which means it has the 
potential to be used in conjunction with any application. The small targets which are present in modern day 
applications will still pose difficulties for accurate selection which may limit its use in standard windowed 
applications. The more invasive use of electro-oculographic potential can also be a drawback as users may not 
relish the idea of having electrodes positioned on their face which is unnatural and not necessarily suited to 
use in all environments. The advancement of technology has allowed for eye-tracking systems which are far 
less invasive than the EagleEyes system and therefore provides much potential as an alternative means of 
input. The possibility of adapting EagleEyes for use with other eye-tracking technology is yet to be explored. 
GazeSpace also provides a hands-free gaze based selection system to browse content spaces, such as blogs, 
news pages, video and image clips (Laqua, Bandara & Sasse, 2007). GazeSpace utilises the centre of the display 
as a main area in which information is displayed. Smaller contextual navigational elements surround this main 
area and users can navigate to one of these using eye gaze. Upon selecting one of the navigational elements it 
will be enlarged and moved into the centre area. Three different selection means were provided, including a 
type of accumulative dwell time where users were allowed to look at another element and then return to the 
original element without any of the accumulated time being lost. User preference was highest for this 
accumulated threshold since it facilitated faster selection. Overall user reaction to GazeSpace was very 
43 
 
Chapter 2 Theoretical Background 
positive. A shortcoming which can be identified in this study is that the accumulated dwell time may eventually 
result in the Midas touch problem unless a mechanism is provided to cancel all accumulated dwell time should 
the user deem it necessary. 
EyeDraw (Hornof et al., 2004) is another example of a gaze-sensitive application. EyeDraw is a drawing 
application where the cursor is controlled by the eye movements of the user who can toggle between looking 
at the drawing or drawing on the canvas by enabling and disabling a gaze sensitive button. Dwell time was 
used as an activation mechanism and once the drawing command had been executed, auditory feedback was 
given. Ten fully-able children were used to test EyeDraw and the quality of the drawings produced indicated 
that EyeDraw can be used to draw pictures with eye gaze. 
Other applications in which eye-tracking has been used to enhance human-computer interaction are 
RealTourist (Qvarfordt, Beymer & Zhai, 2005), virtual reality (Jacob, 1993b), a number of gaming genres (cf. 
Jönsson, 2005; Špakov, 2005) and EyePliances (Shell et al., 2003b). 
 
2.8.6 Market trends of eye-tracking 
“Eye tracking technologies could transform the sli voef tens of thousands of people ... The most mexet re
example of how this technology is used is its atyb itloi  give voice to people who are 'locked-in', pleo 
who can only move their eyes and only communicaithe wtheir gaze”  (Kari-Jouko Räihä as quoted in 
COGAIN, 2006, p. 1). 
The application of eye-tracking in the field of human-computer interaction started in the 1980s after a surge in 
the popularity of eye-tracking in the 1970s (Jacob & Karn, 2003). Each resurgence in eye-tracking led to more 
advances in both the application thereof and the hardware and technology needed for successful use of the 
eye-tracker (Jacob & Karn, 2003). Consequently, eye-tracking has been viewed as a promising technology for 
decades without ever quite living up to the expectations, similar to speech recognition. Jacob and Karn (2003) 
offer the assurance that for a technology to be seen as promising for so long it must be very promising 
otherwise it would have long since been abandoned; they caution however that there must be something 
holding back eye-tracking from reaching its full potential. Possible reasons for this could be technical problems, 
the labour-intensive data extraction and the difficulties experienced with data interpretation (Jacob & Karn, 
2003). Resolutions to these problems have slowly been forthcoming which could possibly lead to the eventual 
adoption of eye-tracking as a mainstream usability evaluation tool and/or input device. 
Currently available eye-tracking technologies are expensive and beyond the financial reach of most users. This 
poses the greatest obstacle in preventing widespread adoption of eye-tracking technology (Kumar, 2006). Even 
amongst disabled users, perhaps the group which stands to benefit the most from such technology, the use is 
limited to the select few who can afford the acquisition of the expensive equipment (Kumar, 2006). The 
development and availability of an application, dubbed a “killer application”, which uses eye-tracking 
technology will increase demand for eye-trackers and cause a substantial reduction in prices (Kumar, 2006). 
The successful incorporation of eye gaze into a mainstream application could give the acceptance of such 
technology the boost that it needs. 
In September 2004, the Communication by Gaze Interaction (COGAIN) undertook an ambitious project to 
establish standard control software that need not be tied to the proprietary software of eye-tracking vendors 
(COGAIN, 2006). In this way, COGAIN aims to make development of eye gaze software accessible to a wider 
audience (COGAIN, 2006). Additionally, COGAIN aims to develop a more affordable eye-tracker solution using 
standard webcam and ambient light (COGAIN, 2006). Should this be achieved, the software capabilities should 
be ready to handle the deluge of demand for software to use with newly available and affordable hardware. 
This places the onus on the HCI community to ensure that these hardware advances will not be in vain but that 
the relevant software will be in place to exploit these technologies. Therefore, studies such as the current one 
44 
 
Chapter 2 Theoretical Background 
are pertinent to the advancement of the software to ensure that they progress to meet the capabilities and 
availability of the hardware. 
Apart from such aforementioned initiatives, cost effective means of providing such technology have been 
explored with great success. As would be expected, it is suggested that by using cheaper produced eye gaze 
technology, the accuracy and performance indicators of such systems would be substandard to the more 
state-of-the-art systems, which employ the use of high-precision cameras, eye recognition firmware and video 
processing systems. So it would hardly be worth the effort to produce lower cost hardware if the resulting 
software developed was inferior in terms of accuracy, speed and other performance indicators. However, 
contrary to these assumptions, it has been found that through use of a standard PC and a fairly inexpensive 
widely available webcam, it is possible to achieve acceptable selection performance results (Corno, Farinetti & 
Signorile, 2002). A reliable and inexpensive eye-tracker which is capable of using infrared light to track eye 
gaze in real time has already been developed (Haro, Essa & Flickner, 2000). GazeTalk is a gaze-based typing 
communication tool designed specifically for people with Amyotrophic Lateral Sclerosis (ALS) who are unable 
to communicate in any way but with their eyes (Hansen et al., 2003). The system is capable of functioning on a 
standard computer using commercially available digital camera technology (Hansen et al., 2003) which makes 
the system very attractive in terms of affordability. Furthermore, a low-cost gaze-enabled user interface was 
developed which achieved rates of up to 99% accuracy in locating the eyes correctly (Su et al., 2005).  
The slow acceptance rate should not be disheartening as there is often a fairly long time span between 
invention and widespread use of devices. For example, ten years after the mouse was invented, it could only 
be found in a handful of research laboratories and it was only after twenty years that it was found in arenas 
other than research laboratories (Jacob & Karn, 2003). The acceptance of eye-tracking recently received a 
boost in the form of the world’s first eye-controlled laptop which was released on 1 March 2011 (Tobii, 2011). 
In collaboration with Lenovo, Tobii has provided control for icon selection, zooming and centring of the 
working area. The screen is also capable of auto-dimming and brightening depending on whether the user’s 
eyes are recognised (Tobii, 2011). Although the technology is still expensive at this stage, the fact that a fully-
functioning prototype could be developed is most encouraging for eye-tracking technology. 
In conclusion, the cost associated with such technologies should not be disheartening to the research 
community who must continually strive to find ways in which the technology can be used. This could result in a 
decrease in the cost of the equipment, but even in the event that this does not occur, cheaper highly accurate 
eye-tracking devices may be available to fill the void. The current study therefore aims at the advancement of 
the software applications of the technology, specifically in a multimodal capacity. 
 
2.9 Multimodal interfaces 
Previous sections discussed the shortcomings and limitations of using eye gaze and speech as isolated 
interaction techniques. This study proposes to combine these two in a multimodal capacity to determine 
whether the shortcomings of one can be compensated for through the use of the other. Previously, fears were 
expressed that using two error-prone interaction techniques together in a multimodal interface would result in 
an interface that compounded the errors but it has since been proven that the multimodal interface is in fact 
more robust. Some examples will be discussed in a subsequent section. Moreover, suggestions have been 
made to abandon the possibility of eye gaze as the only input device, but rather to use it as one modality in a 
multimodal interface (Hyrskykari, 1997).  
Speech and gesture seem to be a very popular choice for multimodal interfaces as is evidenced in the number 
of applications thereof (cf. Bolt, 1980; Hauptmann, 1989; Latoschik, Frӧhlich, Jung and Wachsmuth, 1998; 
Oviatt et al., 2000). This is perhaps because humans tend to talk with their body and in human-human 
45 
 
Chapter 2 Theoretical Background 
communication it is not only the speech content which plays a role in the understanding but also body 
language. In order to truly emulate human-human communication it may be necessary to interpret a full set of 
gestures, including hands, head and eye gaze, with speech. 
Eye gaze systems are an attractive alternative to direct manipulation with a mouse since users naturally look at 
an object of interest and they are accustomed to completing other tasks while looking (Sibert & Jacob, 2000). 
A multimodal interface which uses eye gaze as one of the modalities will enrich user experience as it will serve 
to reduce ambiguous commands (Sibert & Jacob, 2000). Although it would seem a natural assumption that a 
computer user would look at a target before attempting to click on it, Smith, Ho, Ark and Zhai (2000) found 
variability in hand eye coordination during target selection using various devices, to the extent that the same 
individual exhibited different tendencies even when using the same input device. Coordination techniques 
varied between eye gaze preceding the cursor, the cursor preceding eye gaze and the eye gaze switching 
between the cursor and the target until the target was reached (Smith et al., 2000). Therefore, it is imperative 
that researchers first determine the capacity in which eye gaze can be interpreted within a multimodal 
interface. 
 
2.9.1 Classification of multimodal interfaces 
Coutaz and Caelen (1991) offer a taxonomy for multimodal interfaces based on the number of modalities 
which can be used simultaneously. An exclusive multimodal user interface offers a choice of modalities 
although input is obtained from one modality only. In contrast, a synergic multimodal user interface also offers 
various modalities but input is built from multiple modalities being used in unison. For example, when speech 
and gestures are used, a verbal command such as “Put that there” can be interpreted through simultaneous 
interpretation of gestures. 
Multimodal interfaces can also be tactile, auditory or visual. Tactile interfaces require physical contact, 
auditory use some form of speech or sound detection while visual interfaces detect human movement in some 
way, for example eye gaze.  
The current study will compare a synergic visual and auditory multimodal interface with a standard tactile 
interface where the user types using a keyboard and uses a mouse device for pointing and selecting.  Options 
for an exclusively visual interface will also be available. 
 
2.9.2 Implementation of multimodal interfaces 
Bernhaupt et al. (2007) conducted a study with two mice and speech recognition and found that such an 
interface was quickly adopted as a natural means of interaction for a satellite monitoring application. 
However, during the course of two-handed interaction, participants often neglected to make use of speech 
interaction (Bernhaupt at al., 2007). Nevertheless, task solution was more efficient when using two mice than 
with only one and also resulted in lower cognitive load (measured as the number of fixations during task 
completion) on the users. However, user perception of the cognitive load experienced was not measured. A 
possible reason for not using speech when engaged in multi-mouse manipulation could be linked to the strain 
the user felt while concentrating on moving two mice. Overall, the cognitive load could be lower when using 
two mice but expecting users to use two mice and speech commands could be unrealistic. The study could 
have been enhanced by eliciting user satisfaction with the system and asking users why they neglected to use 
speech commands under certain conditions. The introduction of an additional modality could also be delayed 
to the stage when the user has completely mastered working with two mice. 
A vision and gesture-based application has been offered as a possible multimodal interface as body posture, in 
general, and pointing are natural modalities (Wachs et al., 2011). However, these mean nothing if the system 
46 
 
Chapter 2 Theoretical Background 
is not aware of what is being pointed at, hence the need for combination with vision. This universal concept 
should be intuitive and hand gestures have been suggested as a universal language (Wachs et al., 2011). 
However, cultural references and context may play a huge role in the definition of a particular hand gesture 
and this modality may not be as universal or intuitive as suspected. 
QuickSet is an application for use by the military for force “laydown” (Cohen et al., 1998). QuickSet uses a 
multimodal interface consisting of a pen and voice. Objects can be placed on a map by simultaneously drawing 
and speaking. Users of QuickSet were able to achieve substantially higher speeds than when using a traditional 
GUI interface (Cohen et al., 1998). Users also indicated a preference for the multimodal interface over direct 
manipulation (Cohen et al., 1998). 
Intellectual Computer Assistant for disabled operators (ICANDO) uses head movements and speech commands 
while Multimodal Oral With Gesture Large display Interface (MOWGLI) uses gesture recognition and speech 
recognition to create a collaborative environment in which two users can work together always being aware of 
the other’s actions (Karpov, Carbini, Ronzhin & Viallet, 2008). Using Fitts’ experiments it was found that 
MOWGLI performed better as a pointing device than ICANDO.  
Speech and gestures have also been used for gaming through the development of a gaze-aware table (Tse, 
Greenberg, Shen & Forlines, 2006). 
 
2.9.3 Eye gaze and speech multimodal interfaces 
Previous sections have discussed the use of eye gaze as well as speech recognition as an input modality. 
Insofar as can be ascertained these particular modalities are not often used in combination for multimodal text 
input. When used in isolation, these and other alternative modalities such as gestures are often ambiguous but 
when appropriately used in combination, they could result in effective interaction methods (Oviatt, 1999). 
The goal of speech and vision multimodal interfaces is to emulate the ease and robustness of human 
communication through the integration of automatic speech recognition (ASR) and the nonverbal 
communication afforded by the use of eye gaze (Pireddu, 2007). In particular, the foremost aim of multimodal 
interaction is to integrate interaction methods to allow the advantages of one to supersede the drawbacks of 
another. Amongst the variety of available alternative input modalities, the combination of speech recognition 
and eye gaze has not gained much popularity, yet when eye gaze is used for locating objects and speech for 
issuing commands a fully functional system is entirely feasible (Miniotas et al., 2006). Given the inherent 
problems associated with target selection via eye gaze (section 2.8.4), it seems plausible that an additional 
modality might make selection easier and more feasible. For example, the Midas Touch problem will be 
minimised as two inputs will be required before the application will respond. Furthermore, the ambiguity 
caused by inaccurate eye-tracking could be negated if an additional modality was available to infer user 
intention. To date, though, there have been very few empirical studies conducted to explore this 
phenomenon. 
Gaze as an input medium has the advantage of being a reliable indicator of the current focus of attention and 
since it is a natural input medium it does not require any hand-eye coordination to be learnt (Kaur et al., 2003). 
The eyes are expressive during conversation, which is often punctuated with gestures as well (Kaur et al., 
2003). Therefore, it would seem natural to combine eye gaze and speech in a multimodal environment. 
However, when doing so, it is imperative that eye gaze and speech be synchronised so that intention can be 
inferred correctly (Kaur et al., 2003).  As recently as 2009, it was said that an interface capable of improved 
human-computer communication through the use of eye gaze and speech is still a long way from being 
possible and requires further research to investigate the possibilities (Drewes & Schmidt, 2009). 
47 
 
Chapter 2 Theoretical Background 
Zhang, Imamiya, Go and Mao (2004) confirmed that a multimodal interface using eye gaze and speech yielded 
better performance than a speech-only interface. Their application responded to speech commands and used 
eye gaze to resolve ambiguities or to verify what the intended target was, based on its proximity to the eye 
gaze of the user. 
When implementing a system using a combination of eye-tracking and speech recognition, Castellina, Corno 
and Pellegrino (2008) advocate that there are three aspects which play a simultaneous role at any given 
moment in the users’ interaction with the system, namely the objects, the context and the commands. The 
objects are the widgets that are available on the screen, for example, icons, buttons or menus. The context is 
the area which is identified by the eye-tracker as where the eye gaze of the user is focused. This context is also 
referred to as the gaze window. The commands are a list of possible objects or action names within the gaze 
window. The use of a gaze window, which may contain one or more objects, eliminates the error created by 
the detection of the direction of the user’s gaze. The user will utter a command name after gazing at a certain 
area on the screen and the application will then match the utterance to a list of commands/objects contained 
within the gaze window and generated as a VoiceXML grammar. The ambiguity of the uttered command will 
thus be eliminated. Tests indicated that the combination of the two modalities succeeded in overcoming the 
inherent ambiguities present in each (Castellina et al., 2008). 
The nature of the speech commands and their use in the current study precludes the ability to generate a 
grammar based on the gaze window. During normal text input in a word processor, the user may often want to 
change the formatting or perform other related editing tasks. While it may be possible to change the grammar 
based on whether the eye gaze is within the bounds of the onscreen keyboard or not, it was decided that in 
order to increase the naturalness of the application and reduce the number of eye movements required, these 
commands could be issued regardless of where the eye gaze was at that given moment. Therefore, while the 
context will be established, the grammar will not be dependent on the gaze window but typing commands will 
be processed dependent on the gaze position. 
 
2.9.3.1  Acquisition and spacing of targets 
The Portable Interactive Command Console (PICC) is used for crisis management, manpower and equipment 
deployment in the field and an experimental interface included the use of gaze and speech to move objects 
around the interface (Kaur et al., 2003). Results indicated that the correct fixation to use for identification of 
the target object was the one which was acquired, on average, 630 milliseconds before the verbal command 
was issued.  The interface of the current study will interpret the target as the one which has focus when the 
command is processed. Results will determine whether this intended method is sufficient for this type of 
multimodal interface. 
To investigate the feasibility of small, closely spaced targets using speech and eye gaze combined, Miniotas et 
al. (2006) required participants to select a single button in a 5×5 matrix of small closely spaced buttons. All 
squares encompassed within the region of interest (ROI), as detected by the currently detected eye gaze, were 
highlighted by outlining each in a different colour (Figure 2.8). In a mixed modality trial, a participant could 
verbalise the colour of the desired square aloud to select it regardless of whether that square was the selected 
square or not. It was determined that the ideal setup is to have icons sized 30×30 pixels (≈1°) with a 10 pixel 
(≈0.3°) space between them (Miniotas et al., 2006). The study determined that there is high accuracy of target 
selection to such an extent that user performance approaches that of manual pointing.  
 
48 
 
Chapter 2 Theoretical Background 
 
Figure 2.8: Matrix with ROI squares each outlined in a different colour 
Source: Miniotas et al. (2006) 
The use of speech to select a target could possibly be faster than the dwell time of 1500 ms which was used 
and it is suspected that accuracy will also be much higher since the correct button can be selected without the 
eye gaze having to be positioned on it. Therefore, in this instance it would appear that the use of the 
multimodal interface would be more usable than just eye gaze. The colours used must be easy to identify and 
vocalise and the utterances must differ enough to avoid confusion and facilitate adequate recognition. For 
such an application a multimodal interface seems to be a better solution than one using eye gaze in isolation. 
It is suspected that users would also have preferred to use speech and not only to resolve ambiguities. In 
terms of motivation for the current study, this study proves that despite its relative unpopularity as a 
multimodal interaction means, speech and gaze input can be successfully combined to create an environment 
which can be manipulated as accurately as with manual pointing and selection.  Secondly, the optimal size 
identified by the study for targets will be used as a basis for the targets in the current study. 
 
2.9.3.2 Applications 
An implementation of a multimodal interface by Zhang et al. (2004) used eye gaze and speech and required 
users to select differently sized, shaped and coloured figures. Speech commands based on colour, colour and 
shape as well as colour, shape and size were available in order to select an appropriate object. The position of 
the gaze when commands were issued was then used to determine which object had to be selected in 
combination with the speech command received. The use of both eye gaze and speech was found to be more 
robust than using only eye gaze or speech.  
The superior usability of eye gaze and speech in this instance bodes well for the acceptance of these 
modalities, however care must be exercised that the grammar required is not too complicated which will result 
in additional memorisation for the user. The grammars used in this instance may be too limited to include in a 
fully functioning application. Therefore, the limited number of colours which highlight objects within the 
current region of interest of the user seems to be a better solution when viewed in the context of a large scale 
application. However, both of these presume that users have a certain command of the language and are able 
to verbalise a wide-ranging grammar, which might not be the case. The current study will make no such 
assumption when it comes to using eye gaze and speech for pointing purposes. 
EyeTalk is a voice and eye gaze integrated application which allows a user to gaze at an object and issue a 
verbal command which is then captured and merged into a single message and passed to the current 
application as a mouse click or keyboard event (Hatfield & Jenkins, 1997). Users are able to fixate on an object, 
which causes the mouse cursor to move to that position, and then issue a command to execute a mouse click. 
Initial results with EyeTalk showed positive feedback and indicated that users were able to operate the system 
with high efficiency after just a few moments of getting accustomed to the system (Hatfield & Jenkins, 1997). 
49 
 
Chapter 2 Theoretical Background 
Although EyeTalk is application independent and can potentially be used with a multitude of applications it 
was only tested with aviation displays and may not be suitable for use with standard windowed applications. 
Kammerer et al. (2008) tested menu selection with eye gaze only and with eye gaze and speech combined. 
Three different menu designs were used, namely, a linear menu, a full-circle menu and a semi-circle menu. Eye 
gaze was used to establish the menu item that was to be selected and then either gaze or speech was used for 
selection purposes. Results indicated that accuracy with the linear menu was significantly lower than with the 
other two menus but that the input device did not affect the accuracy with which a menu item could be 
selected. The semi-circle menu yielded the fastest selection time of the three different kinds of menus. In 
terms of the interaction technique, the eye gaze and speech interaction techniques had a significantly longer 
selection time than eye gaze only (Kammerer et al., 2008). These findings are contrary to what would be 
expected for selection time but perhaps not for accuracy. Since the menu items were identical for both the 
interaction techniques, the accuracy with which the correct menu item could be acquired when using only eye 
gaze and when using eye gaze and speech should be the same. The time to acquisition can be assumed to be 
the same for the different interaction techniques.  Since selection with the dwell time is faster, it implies that 
when using speech, the time required to issue the command, synthesise it and react to it, is significantly longer 
than 750 ms.  
EyeCook is an attentive multimodal cookbook using eye gaze and speech which changes the display based on 
whether the user’s attention is focused on the book or not (Shell, Bradbury, Knowles, Dickie & Vertegaal, 
2003a). If the user is looking at the cookbook then the recipe is shown on one page, otherwise the recipe is 
broken into multiple cards with enlarged text. Speech can then be used to issue verbal commands which are 
context sensitive based on the position of the eye gaze at the moment when the command is issued (Shell et 
al., 2003a). 
Eye gaze and speech have also been effectively combined to design a user interface with a 360° panoramic 
view (Stiefelhagen & Yang, 1997) for implicit and explicit command activation in aircraft (Schnell, 2000), 
attentive television and AuraLamp, an eye gaze and speech enhanced lava lamp (Shell et al., 2003b). 
These studies highlight the potential uses of eye gaze and speech for a multimodal interface and encourage 
researchers to continue to find other uses for these modalities. Specifically, eye gaze and speech will be used 
for text entry and the next section will discuss some of the text entry applications which employ the use of 
these modalities. 
 
2.9.4 Text and data entry using eye gaze and speech 
In terms of data entry, eye gaze and speech recognition have been implemented, with great success, to 
complete a television licence application form in the United Kingdom (Tan, Sherkat & Allen, 2003a). The edit 
box which the user is looking at receives focus and then dictation can be used to complete the forms. This 
method was compared to the mouse and keyboard, handwriting and speech only. Even though eye gaze and 
speech was neither the fastest nor the most accurate, it was the most preferred method of data entry (Tan et 
al., 2003a). This could be attributed to the naturalness of completing a form in this manner. Another possibility 
which could be investigated is the use of eye gaze to set the focus and the keyboard to enter data. 
Another means of data entry is the RESER and SPELLER systems (Tan, Sherkat & Allen, 2003b). The keyboards 
used in these systems are cluster keyboards and users are required to look at the relevant key on the keyboard 
and then speak the letter that they wish to type. The RESER system will attempt to recognise the word and 
offer a suggestion once it can recognise the word that is being typed. The user must then give confirmation as 
to whether or not that was the intended word. The SPELLER system, on the other hand, requires users to spell 
out the entire word. Visual feedback to indicate focus is through highlighting the button on the keyboard. For 
50 
 
Chapter 2 Theoretical Background 
text entry, users preferred the mouse and the keyboard while speech and eye gaze was the preferred means 
of data recovery. 
The fact that a button must receive focus significantly reduces the size of the potential vocabulary which must 
be recognised. It is surmised that the accuracy rate of such a method would be much higher than if a full-
length vocabulary was present at all times. In addition, the use of fewer buttons but with more letters on 
reduces the amount of screen real estate which is required by the onscreen keyboard thus lessening one of the 
associated disadvantages of eye gaze as an input device. The use of a grammar comprising alphabetic 
characters and the fact that a visual cue is also available reduces the amount of learning and memorisation 
that is required by the user. The use of multiple modalities should also increase the accuracy for text entry 
than simply using speech recognition. A combination of the colour scheme used by Miniotas et al. (2006) for 
non-alphabetic characters with this text entry method could provide an all-encompassing interface for eye 
gaze and speech. Of course, this assumes that the user has a wide ranging vocabulary and is capable of all the 
speech utterances which is not always the case. 
Dasher is a text entry interface which uses continuous pointing gestures to facilitate text entry (Ward, 
Blackwell & MacKay, 2000). All letters start on the right hand side of the screen and a user must point at the 
desired character to cause the area around that character to grow larger. The character then also starts 
moving towards the left side of the screen and once it crosses the centre of the screen it is accepted for text 
entry. The size of the letters is also adjusted according to the probability of the letter being selected next in 
order to speed up typing. When using a mouse as the pointing device, users were able to achieve typing 
speeds of 34 words per minute compared to traditional keyboard input speeds of 40-60 words per minute. 
Dasher has since been modified to use eye gaze as a pointing device for a hands free environment (Tuisku, 
Majaranta, Isokoski & Räihä, 2008). During the first session with the modified version, typing speed was 2.5 
WPM while after the tenth session of working with Dasher, users were able to type an average of 17.3 WPM 
which indicates that learning is required to achieve acceptable speeds with the application. The result is 
nevertheless a promising one as it offers an intuitive hands-free means of text entry for a variety of users. 
Comparison with a mouse showed significantly slower entry rates with eye gaze than with the mouse but no 
significant difference in error rate was detected between the two pointing devices (Tuisku et al., 2008). Only 
one session was completed with the mouse where participants achieved an entry rate of 20.69 WPM which 
was only slightly higher than that of eye gaze after ten sessions. However, it cannot be said that eye gaze and 
mouse input are comparable in this instance since the previous Dasher study showed that speeds of 34 WPM 
were possible after extended practice with the mouse. It can, however, be concluded that eye gaze may 
require more practice than the mouse but whether the speeds will be comparable once both modalities reach 
a plateau, remains to be seen. 
Dasher has been proven to respond well to control via a brain-computer interface (Felton, Lewis, Wills, Radwin 
& Williams, 2007) and the aptly named Speech Dasher extends the capabilities of Dasher even further by 
including speech recognition as well (Vertanen & MacKay, 2010). Speech Dasher uses the same selection 
technique as the original Dasher but allows the user to zoom through entire words. The word set is obtained 
through speech recognition where the user speaks the text they would like to enter. With an error recognition 
rate of 22%, users were able to achieve typing speeds of 40 WPM (Vertanen & MacKay, 2010) which is similar 
to keyboard text entry. Speech Dasher is an example of a multimodal interface where gaze is used to enhance 
the capabilities of speech recognition. In the current study, eye gaze and speech will be used simultaneously in 
such a manner that the disadvantages of one are counteracted by the other. 
The current study will build on the idea that eye gaze will be used to establish which keyboard button is 
required by the user. However, instead of relying on the inaccurate or time-consuming methods of eye gaze 
only, an additional modality is suggested. The use of look-and-shoot with a physical trigger assumes that the 
user may have some mobility although it may be possible to use a triggering mechanism such as blowing in a 
pipe. Instead, this study will remove the reliance on physical dexterity and will build on the idea proposed by 
51 
 
Chapter 2 Theoretical Background 
Tan et al. (2003a) that speech could be used to activate the focused key. However, it also assumes that some 
users may have limited vocabularies and may not be able to vocalise all alphabetic letters. Therefore, a single 
command, which can be customised to meet the abilities of the user, will be used to activate the key which 
currently has focus. Through this means it will be possible to provide text entry capabilities using eye gaze and 
speech. This method of text entry may eventually prove to be more accurate than dictation giving the inherent 
recognition error rate with dictation systems. Furthermore when accounting for time spent on error 
correction, it may also be faster. The scope of this study will only encapsulate the comparison of this entry 
method with the traditional keyboard but comparison with dictation is proposed for future research. 
 
2.10 Summary 
This chapter discussed some of the relevant literature on which the current study was based. Based on the 
literature review it was hypothesised that multimodal interfaces may offer a more intuitive and natural means 
of human-computer communication. Modalities on their own oftentimes have associated disadvantages which 
can prohibit widespread acceptance and use of the modalities. However, instead of aggravating the problems 
experienced with the individual modalities, a multimodal interface can potentially compensate for 
disadvantages of one by drawing on the advantages of the other. Many examples of this were discussed in this 
chapter.  
Since eye gaze and speech were the chosen modalities of the multimodal interface in the current study, each 
of these was discussed in detail. Interaction methods as well as methods to increase the accuracy of eye gaze 
were discussed. The identified activation mechanisms for eye gaze, namely dwell time, blinking and look-and-
shoot, will all be included in the multimodal interface which will be developed as well as some of the 
mechanisms suggested to increase the accuracy of eye gaze. However, in order to negate the disadvantages of 
eye gaze, it will be coupled with speech. Negativity about continued use of speech recognition may stem from 
the high memorisation rate and system response and capability as well as the environments which are best 
suited to speech recognition. Memorisation of single commands using word processor terminology is not 
considered a serious consideration as word processors have a language which is unique to their environment 
and which has not curbed the popularity of the software, as is evidenced by its widespread use. Novice users 
may experience some difficulty, but it should not be more than the normal learning curve experienced when 
using the software. An additional mnemonic strain could occur if users are expected to remember an entire 
sequence of commands without visual feedback. However, if the speech commands are closely coupled with 
the naming of menus and tabs and executed in the same sequence as mouse clicks (for example, the menu 
name which causes the expansion of the menu, the menu command which opens the dialogue box and then 
utterance of a command), with the same level of visual feedback, it should not be a problem. The use of 
speech recognition will also alleviate the need to provide alternatives for all types of mouse clicking with eye 
gaze only, as commands can be provided to circumvent this. Using speech as an activation method for eye 
gaze could improve the speed and accuracy of other activation methods. 
Many multimodal interfaces have already been empirically investigated but the combination of eye gaze and 
speech is a relatively new area, particularly when used for text entry. Some results have been forthcoming in 
this area but it remains to be seen whether eye gaze and speech will be able to achieve the speeds of more 
traditional means of input. Insofar as can be ascertained, the multimodal interactions have never been fully 
integrated into a mainstream application or a fully functional word processor. The development of such an 
application will be discussed in the following chapter as well as the methodology which will be followed to test 
the usability of the multimodal word processor. 
  
52 
 
 
CHAPTER 3 
EXPERIMENTAL DESIGN AND METHODOLOGY 
 
3.1 Introduction 
The previous chapter discussed some of the available literature which was used to motivate the study and 
upon which it was based. Different types of gaze interaction methods were identified and the advantages and 
disadvantages of both eye gaze and speech were discussed. This chapter will discuss the experimental design 
which will be used to answer the research questions which have been posed. In particular, details of the actual 
tests used and the procedures followed will be elaborated upon. 
 
3.2 Experimental design 
The main aim of the study was to determine the feasibility and usability of eye gaze and speech when used as 
an interaction technique in a word processor (Section 1.2). Therefore, the study can be divided into two main 
parts, namely the feasibility and the usability of such a multimodal interface. In order to evaluate the feasibility 
of such an application, two phases were identified which had to be completed, namely: 
1. The proposed application had to be developed in order to tentatively verify the feasibility of 
incorporating a multimodal interface into a word processor. 
2. The feasibility had to be verified through a more concrete means than simple development. 
In order to meet the requirements of the first phase, an application was developed which incorporated all the 
features of the proposed multimodal interface. This will be discussed in section 3.3. The second phase was 
achieved by conducting a feasibility test using Human-Computer Interaction (HCI) researchers as a sample. The 
experimental design of this phase will be discussed in section 3.4.1. 
Once the feasibility has been established, the usability of the application will have to be tested. For these 
purposes, the functions that needed to be performed were identified. The experimental design for this phase 
will be discussed in sections 3.4.2 and 3.4.3. 
 
3.3 Development of the application 
3.3.1 Motivation  
As previously discussed, a word processor application is one of the most popular and widely used applications. 
Furthermore, Microsoft Word® is the leader in the word processor market with high market penetration 
(Bergin, 2006a; Bergin 2006b). The interface used by Microsoft in previous versions has become the de facto 
standard for interfaces of similar packages as well as other types of applications and it may be said that it 
paves the way for the establishment of trends. Therefore, it stands to reason that Microsoft Word would play 
a central role in this study. Additionally, disabled users are often relegated to using software which has been 
specially developed for them. This software often does not encompass the full functionality of a product such 
as Microsoft Word. Moreover, the support and availability of such software is frequently not of the same 
53 
 
Chapter 3 Experimental Design and Methodology 
standard as offered by more mainstream suppliers. Consequently, Microsoft Word provides an ideal 
environment for the development of a truly multimodal interface. 
 
3.3.2 Hardware 
The eye-tracker used during the study was a Tobii T120 eye-tracker (www.tobii.com). This eye-tracker was 
chosen due to its availability at the university at which the study was conducted. The data rate of the T120 
eye-tracker is 120Hz and the accuracy is measured at 0.5 degrees. The results obtained during the study 
should be interpreted with reference to the eye-tracker used as there are trackers available with both higher 
precision and accuracy. Since the only eye-tracker that was available for the study was the Tobii T120, the tests 
could not be conducted on a range of eye-trackers in order to determine the effect that it may have on the 
results. 
A Logitech webcam with a built-in microphone was used to capture verbal utterances for speech recognition. 
The computer used had a quad core i7 processor with 4 GB of RAM. The screen resolution was set to 
1280×1024 at all times and the participants were requested to sit approximately 60 cm from the screen. 
 
3.3.3 Development tools 
Visual Studio Tools for Office (VSTO), allows .NET developers to customise not only the interface of the Office 
suite but also to add the functionality that is required (Anderson, 2009). Therefore, VSTO was used to 
manipulate Microsoft Word to make a multimodal interface within a well-known environment. The integrated 
development environment (IDE) of Visual Studio 2008 was used. The programming language was C# using the 
.NET framework 3.5. 
In order to include the interaction techniques of speech and eye-tracking, some third party tools were 
required.  The Microsoft speech application programming interface (SAPI) is the native speech API for 
Windows (Microsoft, nd) and provides access to text-to-speech (TTS) engines as well as automatic speech 
recognition (ASR) engines (Simon, 2002). The SAPI software development kit (SDK) provides samples and tools 
to incorporate speech capability in developed applications (Microsoft, n.d.). The SAPI allows the use of 
dictation in an application or specialised grammars can be created for use within the application. The 
Microsoft SAPI is free to download and since its capabilities were deemed sufficient for the purposes of this 
study, it was used to provide speech capabilities for the multimodal interface. 
In order to provide eye-tracking capabilities, the Tobii® SDK was used. The availability of the Tobii eye-tracker 
at the university at which the study was conducted was the overwhelming factor in selecting its associated SDK 
for use.  
Magnification capability was also provided in the application. For these purposes, the relatively inexpensive 
Magnifying Glass Pro® (www.workerscollection.com/wcollect/english/html/mg_pro.html) was used. This tool 
allows the magnification of the area directly under the mouse cursor. Furthermore, it is one of the only tools 
discovered which allow for capturing of mouse clicks on the magnified area. These mouse clicks are then 
automatically transferred to the underlying area and the application responds appropriately without having to 
close the current magnified area. As such, this tool provides functionality which many freeware tools were 
lacking. Should it be found that the tool increases productivity and assists the user in invaluable ways, future 
research can include the development of a free-to-use magnification tool which achieves the same 
functionality as Magnifying Glass Pro, or alternative ways can be investigated to use currently available 
freeware products and to allow the magnified area to be interactive.  
 
54 
 
Chapter 3 Experimental Design and Methodology 
3.3.4 Interaction techniques  
The interaction techniques of eye gaze and speech were proffered as solutions to create a highly customisable, 
hands free multimodal interface which could potentially cater for a diverse group of users with varying 
capabilities and levels of expertise. The aim was to make an all-encompassing application through which all of 
the necessary interaction could take place so as to minimise disruption for the user who is then not required to 
switch between multiple applications. To this end, all tools such as the calibration, setting of the gaze 
interaction sensitivity and others were all included in the application. For example, Figure 3.1 shows the results 
of a calibration contained within a separate window in Microsoft Word. The only external tool which was 
required was the training wizard for the speech engine. For this the user must use the wizard through normal 
Windows interaction and not through the developed application. 
 
 
Figure 3.1: Calibration process in Microsoft Word 
The discussion in Chapter 2 highlighted various ways in which eye gaze could be used as an input technique, 
namely dwell time, look-and-shoot, gaze gestures and blinking. Apart from gaze gestures which were not 
implemented due to time constraints, all the other means of communication were incorporated into the 
application. Look-and-shoot uses the Enter key as an activation mechanism. Furthermore, the sensitivity of the 
dwell time can be set to allow further customisation. Additionally, eye gaze and speech recognition can be 
used in combination as an interaction technique similar to look-and-shoot but where the activation 
mechanism is a speech command and not a physical device.  
In order to be able to use these interaction techniques for text input, onscreen keyboards are available and are 
displayed as a panel in Microsoft Word. The figure below shows the onscreen QWERTY keyboard. The 
keyboard not only has alphabetic characters but also provides commonly used keys such as Page Up, Page 
Down, Home, End and Delete keys. The user is also able to activate and deactivate Caps Lock. A Select All 
button is provided to allow an easy method of selecting a large amount of text with a single click. When a 
button on the onscreen keyboard is pressed using any activation mechanism, audible feedback is given in the 
55 
 
Chapter 3 Experimental Design and Methodology 
form of a clicking sound. Therefore, it will not be necessary for users to look at the document in order to 
obtain confirmation that the button has been pressed. 
 
Figure 3.2: Onscreen QWERTY keyboard 
One implication of using eye gaze to simulate a pointing device is that buttons and selectable targets must be 
larger than in standard interfaces. Additionally, spacing between targets might have to be adjusted in order to 
allow a margin of error around each selectable target so that it can be identified accurately for activation 
based on proximity of the eye gaze to each selectable target in the area immediately surrounding the eye gaze. 
Therefore, the buttons on the keyboard were larger than standard Windows buttons and were also more 
widely spaced. The arrow buttons on the bottom left of the keyboard allow the user to respectively decrease 
and increase the size of the keyboard keys. Resizing the keyboard keys also causes the size of the lettering on 
the keys to be resized proportionally. Magnification was also proposed as a possible solution to decrease the 
amount of screen real estate which is lost whilst still enjoying the advantages of larger selectable targets. The 
use of the magnification tool is shown in the figure below. The yellow arrow on the figure indicates the current 
position of the mouse cursor. The default zoom factor enlarged the area to double its actual size within a 
400×300 window.  
 
 
Figure 3.3: Magnification of the onscreen keyboard 
It was established in a prior section that feedback is vital when using eye gaze as a pointing device so that the 
user is aware of the position at which the eye gaze is being detected. However, a gaze indicator which is slaved 
to the eye gaze may disturb the user as the gaze indicator will never be still if it accurately reflects eye 
movements. Taking this into account, it was decided that the gaze indicator should remain stable within the 
confines of the closest selectable target. Consequently, the gaze indicator does not mimic eye movement but 
is rather stabilised on a selectable target for as long as the eye gaze is situated closest to that target. In order 
to further minimise the negative impact of a gaze indicator, gaze position is indicated only when the gaze is on 
the onscreen keyboard and not when it is situated on other parts of the document.  
Originally, gaze position was indicated by centring a 10 x 10 pixel solid square on the button directly under the 
gaze of the user (Figure 3.4a). During the feasibility testing, before the formal user testing commenced, it was 
recommended that the square not obscure the letter on the key, as this requires that the user look away and 
56 
 
Chapter 3 Experimental Design and Methodology 
then back to confirm that the correct key has been located. Consequently, the square was positioned slightly 
off-centre leaving the letter on the button completely visible (Figure 3.4b). 
 
   
Figure 3.4: (a) Centred and (b) off-centre gaze position indicator 
 
Although the off-centre button allowed the letter on the button to be visible it was feared that novice users 
would follow the gaze indicator and not focus on the button. However, since the square stays in a stable 
position on the button for as long as the eye gaze is on the button or in nearby proximity, even should the user 
follow the indicator, it should have no effect whatsoever. However, to allay concerns that the off-centre 
indicator may distract the users or prevent them from accurately seeing the letter, even though this should 
comfortably be perceived in their peripheral vision, other options were explored for gaze indicators. This 
included a hollow circle (Figure 3.5a) or square (Figure 3.5b) which surrounds the letter on the button, where 
the width and colour of the shape could be set by the user. 
   
Figure 3.5: (a) Hollow circle and (b) square used as gaze indicators 
 
However, the most aesthetically pleasing option, with the simultaneous benefit of providing unequivocal 
confirmation of which button was receiving focus, were the two options included in the final application. 
These were a frame which was drawn around the border of the button and the inverting of the button colour. 
The frame is green in colour which provides ample contrast to alert the user as to which button currently has 
focus. It also has the added advantage that the letter on the button is completely visible and contained within 
the frame. The second option provides visual feedback by inverting the colour of the button which has focus. 
This means that when a button has focus, its background colour is a darker grey and the colour of the letter is 
white. The figure below shows a framed button on the left and on the right, the use of the inverted colour. 
   
Figure 3.6: Visual feedback on a selectable target through (a) framing and (b) inverting colours 
 
Speech recognition was also incorporated as a standalone interaction technique to facilitate a means of 
navigation, editing, selection and manipulating text. The speech engine provides a means for both dictation 
and commands to be issued. A specialised grammar was developed for use within a word processor. This 
grammar allows for formatting controls, document handling, basic and complex cursor control and mouse 
manipulation. The complete set of available commands is tabulated below. 
  
57 
 
Chapter 3 Experimental Design and Methodology 
Table 3.1: Verbal commands 
 Command Application reaction Current key press 
Formatting Bold  Activate/deactivate bold [CTRL] + B 
commands Italic  Activate/deactivate italic [CTRL] + I 
Emphasise 
Underline  Activate/deactivate underline [CTRL] + U 
   
Document Cut  Cut the current selection [CTRL] + X 
handling Copy  Copy the current selection [CTRL] + C 
commands Paste  Paste the current clipboard item at the [CTRL] + V 
cursor position 
Undo  Undo the previous action [CTRL] + Z 
Delete Delete text to the right of the cursor or a [DELETE] 
Remove current selection if present 
   
Basic cursor Down  Move the cursor one position down [DOWN] arrow 
control Left  Move the cursor one position to the left [LEFT arrow 
Right  Move the cursor one position right [RIGHT] arrow 
Up  Move the cursor one position up [UP] arrow 
Home Move the cursor to the start of the [HOME] 
current line 
End Move the cursor to the end of the [END] 
current line 
   
Complex Select line  Select the entire line that the cursor is [HOME] and then [SHIFT] + 
cursor control currently on [END]   
and selection OR 
techniques Requires left mouse click in 
the left margin 
Select word  Select the word subsequent to the [SHIFT] + [CTRL] + [RIGHT] 
cursor or the current selection arrow 
Select word back  Select the word prior to the cursor or [SHIFT] + [CTRL] + [LEFT] 
the current selection arrow 
Shift down  Move the cursor down as though the [SHIFT] + [DOWN] arrow 
Shift key is in 
Shift left  Move the cursor left as though the Shift [SHIFT] + [LEFT] arrow 
key is in 
Shift right  Move the cursor right as though the [SHIFT] + [RIGHT] arrow 
Shift key is in 
Shift up  Move the cursor one position up as [SHIFT] + [UP] arrow 
though the Shift key is in 
Select All Selects all the text in the document [CTRL]+[A] 
   
Mouse Click  Left mouse click  
manipulation Activate  
Select   
Go 
 
An extra tab was added to the ribbon in Microsoft Word (Figure 3.7) to accommodate all the additional 
features which were added. This tab is called Multimodal Add-Ins and allows the user of the application to set 
the interaction techniques as desired. Table 3.2 summarises the settings and explains the options provided on 
the multimodal tab. 
58 
 
Chapter 3 Experimental Design and Methodology 
 
Figure 3.7: Multimodal Add-Ins tab in Microsoft Word 
 
As can clearly be seen from the summary in Table 3.2, the multimodal interface is highly customisable to suit 
the expertise and the current needs and environment of the users. The interaction techniques were added in 
order to complement the existing input methods and were not intended to replace them. Therefore, the 
multimodal interface meets the requirement of having to provide alternative means of input to prevent 
overuse of a single one (Oviatt & Cohen, 2000). The next section will discuss some technical specifications of 
the developed application. 
 
3.3.5 Technical specifications  
As mentioned in a previous section, VSTO was used to modify the interface of the Word environment. Third-
party tools and SDKs were then used to add the required functionality. These tools included the Tobii SDK, 
Microsoft speech API and Magnifying Glass Pro. Therefore, apart from using VSTO, these additional tools had 
to be managed and the functionalities they provided had to be programmed into the VSTO solution. Figure 3.8 
illustrates the classes used in the application to get the complete set of interaction techniques. The class 
diagram does not contain insignificant class attributes, for example, those used to monitor the status of the 
interaction techniques. Similarly, no properties are indicated on the class diagrams. The class diagrams are 
used simply to show the essential functionality of the classes. 
The cCommandList class controls the panel which is displayed to show the command list for the speech 
recognition grammar. 
The cKeyboard class maintains the layout of the standard and alphabetical keyboards. It is also responsible 
for the resizing and consequent spacing of the keys on the keyboard. The KeyPrint method is used by all key 
presses to type the letter associated with the pressed button in the document at the current cursor position. 
The cRibbon class was used for the design and basic onClick events for the Multimodal Add-Ins tab. 
The DocumentContentManager class executes all commands that are issued within the current Word 
document. For example, the SelectLine method will select the line of the document on which the cursor is 
currently situated. The DocumentBoldCommand will send a command to the document to toggle bold 
formatting on or off depending on its current status. 
The cSpeechController class uses the Microsoft speech API to handle both the dictation and command 
mode. The class manages the toggling between dictation and command mode, starts and stops the speech 
engine and loads and manages the use of the Windows speech profiles. The class is responsible for building 
the grammar required, capturing verbal utterances and responding to them via a DocumentContentManager 
object. The commands captured via a cSpeechController object will invoke the correct method in the 
DocumentContentManager object which will in turn send the correct command through to the Word 
document. 
 
59 
 
Chapter 3 Experimental Design and Methodology 
Table 3.2: Multimodal Add-Ins tab functions 
Group Screenshot Explanation 
Magnification This Magnification button allows the user to toggle the magnification on and off. A 
standard Microsoft Office toggle button is used to ensure that the user is always aware 
of the status. 
 
 
 
Keyboard This group allows the user to control the display of the onscreen keyboards. 
When the user presses the Standard Keyboard button, a standard QWERTY layout 
keyboard is displayed along the bottom of the Word document (Figure 3.2). 
When the user presses the Alphabetic Keyboard button, an onscreen keyboard is 
displayed along the bottom of the Word document configured in alphabetical order. 
 
Speech This group allows the user to control the use of speech within Microsoft Word. 
The Start Engine and Stop Engine are two mutually exclusive buttons used to control 
whether the speech engine is active or not. When the user presses the Start Engine 
button, the speech engine is activated and the interface will react to any verbal 
utterance that is captured. The user must then press the Stop Engine button in order 
to deactivate the speech engine. 
The Profiles drop-down box loads all the trained profiles from the list that Windows 
 maintains. The user can then select the profile that they would like to use for the 
 speech engine component of the multimodal interface. 
When the Command Mode check box is selected, the speech engine only responds to 
words contained within the word processor grammar (Table 3.1). Otherwise, when the 
speech engine is on, the speech engine is in dictation mode and any verbal utterances 
captured are written to the current document through the Speech-To-Text engine. 
The Say what? button shows a list of acceptable verbal commands which can be issued 
in order to perform common word processing tasks. 
The Typing button allows the grammar to be minimised to commands for selection of 
onscreen targets only. Therefore, all the formatting, text selection and navigation 
commands are disabled when the feature is activated. 
 
60 
 
Chapter 3 Experimental Design and Methodology 
Gaze The Calibrate button allows the Tobii calibration process to start. The calibration is 
required for a new user to ensure that the tracking of the eye gaze is accurate. 
Calibration occurs exclusively through the Word interface (Figure 3.1). 
The Use eye gaze checkbox provides a quick mechanism for the user to toggle the 
reaction to eye gaze on and off. 
The Sensitivity Setting allows the user to determine the length (in milliseconds) of the 
 dwell time and the sensitivity of system response to user blinking. 
 The Gaze Type drop-down box allows the user to choose how the system must react to 
 the eye gaze. 
 • No activation mechanism – This allows the user to use eye gaze and speech 
together. The eye gaze of the user is tracked and when a verbal command is 
issued, the command is executed at the current position of the eye gaze. 
• Dwell time – the system responds to dwell time as set by the sensitivity 
setting. If the user gazes at a particular area for the length of the sensitivity 
setting, then a left mouse click is executed at that location. The length of the 
dwell time can be set to increase the customisation of the application.  
 • Blink – the system responds to blinks of the user. The blink must be more 
pronounced than an involuntary blink so the natural blinking process should 
not interfere with the interaction technique. 
• Gesture – the system will respond to an eye gaze gesture by executing a left 
mouse click at the current location of the eye gaze. Gestures were not 
implemented due to time constraints. 
• Enter Key – this implements the look-and-shoot method of interaction. When 
the user presses the Enter key, a left mouse click is executed at the current 
location of the eye gaze. The gaze location is only interpreted should it be 
located on the onscreen keyboard. Therefore, look-and-shoot does not 
interfere with normal typing on the document area should the user also wish 
to use the keyboard for some typing. 
The Gaze Shape drop-down box allows the user to specify the form of the visual 
feedback on the onscreen keyboard. By default, the button is framed but if users so 
prefer, they can also choose to invert the colour of the button which is currently being 
gazed at. 
 
 
61 
 
Chapter 3 Experimental Design and Methodology 
 
Figure 3.8: Class diagram of developed application 
62 
 
Chapter 3 Experimental Design and Methodology 
The cCalibration class is responsible for the management of all eye-tracker generated data. At the most 
basic level, it handles the calibration process for the Tobii eye-tracker and stores the results of the calibration 
process. It is also responsible for turning the eye-tracking function on and off. Thereafter, if it is required it will 
monitor the eye gaze of the user and respond to blinks and dwell time appropriately. When look-and-shoot or 
speech is used, this class is used to determine the position of the eye gaze so as to interpret where the left 
mouse click must be executed. Furthermore, the class also controls the visual feedback of the gaze indicator by 
determining whether the eye gaze of the user is currently positioned over the onscreen keyboard. In order to 
do this the class employs the use of a gaze stabilising algorithm. Since the eye is subject to noisy fixational eye 
movements, slaving the gaze indicator to the eye gaze in a dedicated fashion will result in a fairly jumpy gaze 
indicator which might distract the users more than it will assist them. Therefore, extracts of the smoothing 
algorithm of Kumar (2007) were used to stabilise the gaze on the button nearest the current eye position. This 
algorithm smooths the data in real time by determining whether the most recent point is the start of a 
saccade, whether it belongs to the current fixation or whether it is an outlier. For these purposes, if the 
distance between two points is more than a previously defined threshold then a saccade is detected. The 
algorithm is robust to noise since it measures “the displacement of each eye relative to the current estimate of 
the fixation location rather the to the previous measurement” and movements one movement ahead which 
are over the threshold are rejected (Kumar, 2007). 
Together with this smoothing algorithm, the fixation points were calculated as the weighted mean of all the 
points in the fixation window using the following formula (Kumar, Klinger, Puranik, Winograd & Paepcke, 2008) 
where p is a data point within a fixation window with n points: 
 
1 + 2 +⋯+  =

 
(1 + 2 +⋯	+ )
 
The result of the implementation of this algorithm is a much smoother movement of the gaze indicator and 
more accurate determination of the eye gaze position. 
 
3.3.6 Resulting multimodal interface 
The previous sections detailed the development of the application, both in technical terms as well as giving a 
visual and all-encompassing discussion on the interaction techniques provided and how they could be used 
within the environment of a word processor. The resulting application was one which was highly customisable 
and provided a multitude of interaction means as replacement options for the traditional keyboard and 
mouse. All of these additional functions could be provided to the end-user through the well-known Microsoft 
Word. 
This development of the complete solution positively answered the first research question which was posed in 
section 1.5, namely whether it was possible to provide a highly customisable multimodal interface using eye 
gaze and speech within a mainstream word processor. The next sections will discuss how the remaining four 
research questions will be answered. 
 
63 
 
Chapter 3 Experimental Design and Methodology 
3.4 Resolving the empirical research questions 
Research questions 2 and 3 (with the three secondary research questions) must still be answered. However, 
they necessitate that a robust experimental design be established before they can be resolved. Research 
questions 2 and 3 as defined in section 1.5 are as follows: 
2. How feasible is such an interface and in which context is it feasible? 
3. How usable is the multimodal interface compared to the traditional interaction techniques? 
a. How usable is the combination of eye gaze and speech when used to simulate a pointing 
device? 
b. How usable are speech commands for performing common word processing tasks? 
c. How usable is the combination of eye gaze and speech when used for text entry? 
The experimental design required to answer research question 2 will be discussed in the following section. 
Thereafter, section 3.4.2 will discuss the methodology which will be used to answer question 3a. The approach 
employed to answer questions 3b and 3c will be discussed in section 3.4.3 
 
3.4.1 Feasibility study 
In order to explore objective 2 as detailed in Chapter 1 and answer research question 1 above, a feasibility 
study will be conducted using the application developed. The scope of a feasibility study is not to identify 
usability problems, but rather to determine whether the envisaged system has long-term potential as a viable 
multimodal interface within the realm of modern-day word processing. 
Due to the nature of the study, five participants are sufficient for such an undertaking (Nielsen, 2000). 
Therefore, five senior members of the lecturing staff of the university where the study was conducted, who 
are proficient in the field of HCI will be approached to participate in the study. A pre-test questionnaire will be 
used to measure their level of expertise and exposure to the technologies used in the multimodal 
questionnaire and is contained in Appendix A.  
The participants will be given a thorough demonstration of the functionalities of the application and then 
allowed some time to become familiar with the system. They will then be requested to complete some simple 
open-ended tasks (Appendix B). The tasks will be left open-ended as the results of the tasks will not be 
evaluated and the premise of this initial study is simply to observe user interaction and allow participants the 
freedom to use the system in order for them to form an objective opinion of the system. At the end of each 
participant’s session, the participant will be required to complete the post-test questionnaire as contained in 
Appendix C. Results of the feasibility study will be discussed in detail in Chapter 4. 
 
3.4.2 Pointing and clicking 
The next research question which must be answered is in regard to the usability of eye gaze and speech when 
used to simulate a pointing device. A suitable means of user testing must first be determined which will 
facilitate the collection of data. The data must fulfil the requirement that at least one measurement per 
usability component must be analysed. 
 
3.4.2.1 Assessment of a pointing device  
The most commonly used metrics to evaluate pointing devices are speed and accuracy (MacKenzie, Kauppinen 
& Silfverberg, 2001) which give a good indication as to whether there is a difference between the performance 
64 
 
Chapter 3 Experimental Design and Methodology 
of pointing devices (Hwang, Keates, Langdon & Clarkson, 2004). In 1954, Paul Fitts proposed a relationship 
between target size and distance to the target which could effectively predict movement time from the 
current position to the targeted position (Fitts, 1954). This relationship was henceforth known as Fitts’ Law.  
Since major pointing devices are used to position a cursor over a target using hand movement (Shneiderman, 
1998), Fitts’ Law has often been applied to HCI. The application by HCI pundits was generally in one of two 
ways, namely to predict the time required to position a cursor over some target based on the distance to 
travel and the size of the target or as a means to derive the throughput (discussed below) by measuring 
movement times and then determining how the different conditions affect the coefficients in Fitts’ Law 
(Soukoreff & MacKenzie, 2004). In this way, it became possible to establish effective efficiency between 
various pointing devices.  
However, since pointing devices are no longer only used to point but also to draw, write and navigate through 
nested menus, Fitts’ law presents somewhat of a limited methodology in terms of pointing devices (Accot & 
Zhai, 1999). These additional uses for pointing devices are all trajectory-based, a feature which Fitts’ Law is ill-
equipped to evaluate. Therefore Fitts’ Law alone, when used to test pointing devices, neglects to test the 
quality of trajectories produced by these pointing devices (Accot & Zhai, 1999) and a more comprehensive set 
of tests is required to strengthen the comparison. 
The inclusion of Fitts’ Law in an International Standards Organisation (ISO) standard ISO 9241-9 (ISO, 2000) 
confirmed its pre-eminence as the leader in the evaluation of pointing devices whilst also providing for 
trajectory-based testing to be performed on pointing devices through extension of Fitts’ Law. The ISO standard 
uses a throughput metric which encapsulates both speed and accuracy (ISO, 2000) in order to compare 
pointing devices and is measured using any one of six tasks including three point-and-click tasks which 
conform to Fitts’ Law (Carroll, 2003). The six tasks included in ISO 9241-9: 
1. Tapping tests (one-directional and multi-directional) 
2. Dragging tests 
3. Path-following tests 
4. Tracing test 
5. Free-hand input test 
6. Grasp and park test 
The one-directional tapping test requires the participant to move from a home area to a target and back. In 
contrast, the multi-direction tapping test consists of 24 boxes placed around the circumference of a circle. The 
participant is then required to move from the centre of the circle to a target box. From there the participant 
must move to and click in the box directly opposite that box and then proceed in a clockwise direction around 
the circle (Figure 3.9) until all the targets have been clicked in and the user is back at the first selected target 
box. The target which should be selected next should always be graphically highlighted for the user (Soukoreff 
& MacKenzie, 2004). 
The dragging test is a variation of the one- and multi-directional tapping test where the user is required to drag 
an object and drop it in the destination target box. The path-following test requires the participant to trace or 
steer along a pre-defined path of a certain width. The fourth test is the tracing test which requires that the 
participant follow a circular path whilst attempting to stay within the bounding circles. The fifth task as set out 
by ISO 9241-9 is designed to test the effectiveness of the pointing device for entering free-hand text or 
pictures (Douglas, Kirkpatrick & MacKenzie, 1999).  
The final test is the grasp and park test during which “the subject performs a simple pointing task and operates 
a key on the keyboard between each pointing with the same hand” (ISO, 2000). This task is also referred to as 
a device switching task (Douglas et al., 1999), as it requires the subject to point at a target and then press a key 
on the keyboard using the same hand as was used to point at the target object. The tasks as set out in ISO 
65 
 
Chapter 3 Experimental Design and Methodology 
9241-9 can be used to evaluate and compare pointing devices and can be selected according to their 
applicability to the pointing devices in question.  
 
Figure 3.9: Multi-directional tapping test using ISO9241-9 
Throughput is measured using the tests as set out in ISO 9241-9 and is reported as bits per second (bps). The 
equation for calculating throughput is Fitts’ Index of Performance, with the exception that an effective index of 
difficulty is used (Zhang & MacKenzie, 2007). The equation for throughput is (Natapov, Castellucci & 
MacKenzie, 2009): 
 Throughput = IDe / MT (1) 
where MT is the mean movement, in seconds, for all trials within the same condition and  
 IDe = log2(De / We + 1) (2) 
where De  is the effective distance to the target and We is the effective width of the target and is calculated as 
 We = 4.133 * SDx (3) 
The effective distance to the target (De) is the distance the subject traversed along the task axis (Natapov et 
al., 2009). In turn, the task axis is measured as the straight line from the centre of the source to the centre of 
the target (Natapov et al., 2009). The term SDx is the standard deviation of the selection coordinates (Douglas 
et al., 1999).  
Apart from the standardised tests to measure throughput, ISO 9241-9 also provides a questionnaire designed 
to assess aspects of the operation, fatigue, comfort and overall usability (ISO, 2000) of the pointing devices.  
Since its inception in draft form in 1998, ISO 9241-9 has been used to compare a multitude of pointing devices 
ranging from joysticks and touchpads (Douglas et al., 1999), to mouse emulators (Man and Wong, 2007), hand 
gestures and even video game controllers (Natapov et al., 2009). It was originally designed to apply to a 
mouse, trackballs, light-open and styli, joysticks, touch-sensitive screens, tablet-overlays, thumbwheels, hand-
held scanners, pucks, hand-held bar code readers and remote-control mice (Douglas et al., 1999). It was not 
designed to cover input devices such as speech activators, head-mounted controllers, data gloves, devices for 
disabled users or foot-controlled devices (Douglas et al., 1999). However, its compatibility with eye-trackers 
and testing with disabled users has since been established through a number of studies (Keates et al., 2002; 
Zhang & MacKenzie, 2007; Man & Wong, 2007; Gajos et al., 2008). 
The ISO tasks that have been selected to be included in this study are as follows: 
1. Multi-directional tapping task 
The task was selected based on its applicability to the study and based on the functions the interaction 
techniques will eventually fulfil in the word processor application. For example, there is no use including the 
66 
 
Chapter 3 Experimental Design and Methodology 
grasp and park since the test requires that a keyboard key must be pressed between selecting targets using 
the same hand as for manipulation of the pointing device. Even though the user will be able to use both eye-
tracking and speech recognition in combination with the keyboard, the eye-tracking and speech recognition 
provide a completely hands-free environment and users will never have to switch devices using their hands. 
While ISO9241-9, similar to Fitts’ Law, is undoubtedly a step in the right direction, allowing researchers to 
establish whether there are differences in speed and accuracy between various pointing devices, it does 
however fail to determine why these differences exist (Keates & Trewin, 2005). MacKenzie et al. (2001) 
propose seven additional measures which will provide more information as to why differences are detected 
between performance measures of pointing devices. These measures are designed to complement the 
measures of speed, accuracy and throughput and to provide more insight into why differences exist between 
pointing devices. The seven measures as proposed by MacKenzie et al. (2001) are as follows: 
1. Target re-entry  
a. If the pointer enters the area of the target, leaves it and then re-enters it, a target re-entry 
has occurred. 
2. Task axis crossing  
a. A task axis crossing is recorded if the pointer crosses the task axis on the way to the target. 
The task axis is normally measured as a straight line from the centre of the home square to 
the centre of the target (Zhang & MacKenzie, 2007). 
3. Movement direction change  
a. Each change of direction relative to the task axis is counted as a movement direction change. 
4. Orthogonal direction change  
a. Each change of direction along the axis orthogonal to the task axis is counted as an 
orthogonal direction change. 
5. Movement variability  
a. This “represents the extent to which the sample points lie in a straight line along an axis 
parallel to the task axis”.  
6. Movement error  
a. This is measured as the average deviation of the sample points from the task axis, regardless 
of whether these sample points are above or below the task axis.  
7. Movement offset 
a. This is calculated as the mean deviation of sample points from the task axis. 
The ISO9241-9 multi-directional tapping task was used to verify these metrics with 16 circular targets, each 30 
pixels in diameter and placed around a 400 pixel diameter outer circle (MacKenzie et al., 2001). These seven 
metrics, as well as throughput, movement time and missed clicks were used in a study to determine the 
difference in cursor movement for motor-impaired users (Keates et al., 2002).  
A further six metrics which could assist in determining why a difference exists, were specifically designed for 
use with disabled users and were proposed by Keates et al. (2002). These measures will not be used during this 
study as they are not considered relevant. An additional metric measuring the number of clicks outside the 
target is also suggested in order to measure the performance of pointing devices (Keates et al., 2002).  
Additional measurements will be analysed in an effort to explain the difference in performance if such a 
difference exists between the interaction techniques. These additional measurements will either be some of 
the afore-mentioned measurements or they will be derived from these measurements. Therefore, the total 
task completion time will be measured as well as the task completion time from when the target is highlighted 
to when it is clicked, the number of target re-entries, the number of incorrect targets which are acquired 
during task completion and the number of incorrect clicks. This will allow efficiency and effectiveness to be 
67 
 
Chapter 3 Experimental Design and Methodology 
tested. Furthermore, the ISO device assessment questionnaire, which is reproduced in its entirety in Appendix 
E (Questions 1-9), will be used to test satisfaction to a degree.  
 
3.4.2.2 Experimental design 
The ISO test requires that the size of the targets and the distance between targets be varied in order to 
measure the throughput. Therefore, variable size targets will be used, but in order to reduce the time required 
to complete a test the distance between targets will not be adjusted during this testing.  
The smaller icon on the Word ribbon is 24x24 (visual angle ≈ 0.62°) pixels in size. This was therefore used as 
the base from which to start testing target selection with speech recognition and eye gaze. Miniotas et al. 
(2006) determined that the optimal size for targets when using speech recognition and eye gaze as a pointing 
device was 30 pixels. This was determined using a 17’’ monitor with a resolution of 1024x768. Participants 
were seated at a viewing distance of 70 cm. This translated into a viewing angle of 0.85°. The eye-tracker used 
in this study was a Tobii T120 with a 17’’ monitor where the resolution was set to 1280x1024. In order to 
replicate the viewing angle of 0.85° obtained by Miniotas et al. (2006), a 30 pixel target could be used but at a 
viewing distance of 60 cm from the screen. Therefore, the next size target to be tested in the trials was 
determined to be a 30x30 pixel button. It was decided to also test a larger target than that established by 
Miniotas et al. (2006). Following the example set by Miniotas et al. (2006) of testing target sizes in increments 
of 10 pixels, the final target size to be used was 40 pixels (visual angle ≈ 1.03°). 
The multi-directional tapping task will have sixteen targets situated on a circle with a diameter of 800 pixels. 
The targets will be positioned on the edges of the circle – thereby creating an inner circle with diameter of 800 
pixels.  Square targets will be used and not circular targets as the buttons in the final application will be 
rectangular in shape. Therefore, it was decided that square targets will be more meaningful since they are also 
allowable under the ISO standard.  
Target acquisition will either be via eye-tracking and speech recognition or the mouse. The mouse will be used 
to establish a baseline for selection speed. When using a verbal command to select a target, the subjects will 
have to say “go” out loud in order to select the target that they are looking at. This method of pointing can 
therefore be considered analogous to look-and-shoot. The word “go” was chosen as it was established during 
development that this was the word which was most accurately captured by the speech engine with minimal 
training. The words “select” and “click” will also be available as verbal commands. 
The literature review uncovered various shortcomings of using eye gaze for target selection, namely the 
instability of the eye gaze and the difficulties experienced in selecting small targets. In order to combat these 
shortcomings, a number of solutions have been proposed. These include magnification and the use of a 
gravitational well. Consequently, both of these techniques will be tested during this phase of the research 
study. The magnification settings were the same as for the Word application while the gravitational well was 
activated within a 50 pixel radius around each button. Another prerequisite of using eye gaze as a pointing 
device is that visual feedback is given at all times. Since the final application provides a choice between 
inverting the colour or framing the button which has focus, both these visual feedback mechanisms will be 
tested in order to establish whether they affect the performance of the pointing device.  
Therefore, there were essentially three varying conditions which could be combined according to the matrix as 
depicted below (Table 3.3). Since the mouse is the benchmark against which the alternative means of pointing 
and selecting must be evaluated, it was not deemed necessary to have a gravitational well with the mouse at 
any point. Secondly, the fact that a faster means of mouse selection was not under inspection meant that only 
the traditional means of mouse selecting had to be measured. 
68 
 
Chapter 3 Experimental Design and Methodology 
Table 3.3: Matrix of test conditions for ISO testing 
 Framed visual feedback Inverted visual feedback 
Gravitational well Eye-tracking and speech Eye-tracking and speech 
recognition recognition 
No gravitational well   
Mouse Mouse 
  
Eye-tracking  Eye-tracking  
and speech recognition and speech recognition 
  
 
Additional trials will also be included using magnification on the 24 pixel targets to determine whether 
magnification alone can allow users to achieve comparable speeds with the standard size icons. For this 
reason, magnification was not combined with the gravitational well and was also not used with both visual 
feedback techniques.  This resulted in a total of fourteen trials per session (Table 3.4), the number of which 
served as motivation for not adding more trials for the mouse as this would simply prolong the session time 
and might cause participants to become irritable and fatigued during the session. Since this could influence the 
results it was decided to forgo additional mouse trials since all participants had to be proficient with the 
mouse and two trials with the mouse was considered sufficient to get an accurate throughput for the mouse. 
 Table 3.4: Multi-directional tapping trials 
Group Trial settings 
M(F) Mouse,30,Framed,No target magnification, No gravitational well 
  
M(I) Mouse,24,Inverted,No target magnification, No gravitational well 
  
MM Mouse,24,Inverted,Target magnified, No gravitational well 
  
ETS(F) Eye gaze and speech,30,Framed,No target magnification, No gravitational well 
Eye gaze and speech,40,Framed,No target magnification, No gravitational well 
  
ETS(I) Eye gaze and speech,30,Inverted,No target magnification, No gravitational well 
Eye gaze and speech,40,Inverted,No target magnification, No gravitational well 
  
ETSG(I) Eye gaze and speech,30,Inverted,No target magnification, Gravitational well 
Eye gaze and speech,40,Inverted,No target magnification, Gravitational well 
  
ETSG(F) Eye gaze and speech,30,Framed,No target magnification, Gravitational well 
Eye gaze and speech,40,Framed,No target magnification, Gravitational well 
  
ETSM Eye gaze and speech,24,Inverted,Target magnified, No gravitational well 
Eye gaze and speech,30,Inverted,Target magnified, No gravitational well 
Eye gaze and speech,40,Inverted,Target magnified, No gravitational well 
 
Three sessions will be conducted in which all 14 trials will have to be completed by all participants. The first 
session of the study will be preceded by a pre-test questionnaire (Appendix D) which will capture user 
demographics and other information pertinent to the study. The full-length questionnaire is contained in 
Appendix E.  The ISO device assessment questionnaire will be administered at the end of testing to measure 
subjective opinion of the pointing device. 
 
69 
 
Chapter 3 Experimental Design and Methodology 
When using a repeated measures design the dangers of asymmetric skill transfer are heightened. Asymmetric 
skill transfer or learning effects are often encountered due to the order in which the tasks or treatments are 
presented to the subject (Poulton & Freeman, 1966). The best way to counterbalance these learning effects is 
through the use of a balanced Latin square (Bradley, 1958; Reese, 1997).  By varying the interaction techniques 
using a Latin square, a measure of control will also be imposed upon the results, thereby lending further 
credibility to the results. Therefore, a balanced Latin square for all trial conditions was obtained by following 
the instructions provided by Edwards (1951). Participants will be randomly assigned to a Latin square condition 
for each session.  
The target button which must be clicked will be denoted by an “X”. An example of one of the trials using 
inverted colour feedback, since the eye gaze is currently focused on the target button, is depicted below: 
 
Figure 3.10: Multi-directional tapping task using eye gaze and speech with target button currently having focus 
 
3.4.3 Word processor functions and text entry 
The word processor functions of navigation, editing and formatting of text as well as text entry will be tested 
together during user testing.  
 
3.4.3.1 Assessment of word processor functions 
Standard usability measures should suffice to test the usability of the word processing functions of navigating, 
formatting and manipulating text in a word processor. A number of usability measures are advocated as a 
means to measure usability of a software application (cf. Bohmann, 2000; Faulkner, 2000; Nielsen, 2001a; 
Nielsen, 2001b; Preece et al., 1994). Usability models which consolidate numerous measurements of usability 
have been proposed and tested by various authors (cf. Abran, Suryn, Khelifi, Rilling & Seffah, 2003; Dix, Finlay, 
Abowd & Beale, 1993). These were also considered for inclusion in the study. It was however found that the 
vast majority of the proposed measurements were either not applicable or were very similar to the five 
measurable objectives proposed by Shneiderman (1998). Therefore, these usability measures were considered 
an acceptable foundation from which usability measures could be extracted in order to measure the 
aforementioned identified components of usability, namely effectiveness, efficiency and satisfaction as well as 
the additional component of learnability. These measureable objectives are as follows (Shneiderman, 1998): 
• Time to learn – How long it takes users to learn the commands or actions necessary to complete the 
task.  
• Retention over time – The extent to which users retain the knowledge they have gained.  
• Speed of performance – Time taken to complete the task. 
• Rate of errors by users – The number of errors made by the user in an attempt to carry out the task. 
• Subjective satisfaction – The level of satisfaction, or how much they enjoyed working with the system 
or parts thereof. 
70 
 
Chapter 3 Experimental Design and Methodology 
Time to learn and retention over time are learnability factors. This portion of the research study will make use 
of longitudinal testing, which requires that the same tasks are completed during a number of sessions. This will 
allow the learnability of the interaction technique to be tested. The speed of performance or time to complete 
the task will be used as a measure of efficiency, as will the additional measurement of the number of actions 
required to complete the tasks. Effectiveness will be analysed in terms of the correctness with which the task 
could be completed as opposed to the number of errors made. Furthermore, subjective satisfaction will be 
tested through the use of a questionnaire. 
 
3.4.3.2  Assessment of text entry 
Measuring the efficiency and effectiveness of text entry requires other measurements to be analysed. The 
underlying concept of testing text entry is to present some text which must be entered using the interaction 
technique which is to be tested. The resultant text which is then entered, often referred to as the transcribed 
text, is then compared to the presented text in order to determine how different they are. The minimum string 
distance (Levenshtein, 1965) or so-called Levenshtein distance can be used for this purpose. This distance is 
calculated as the minimum number of corrections which must be made in order to transform one string into 
another (Wobbrock, 2007). The operations which can be used to transform the strings are the insertion of a 
character (i), the deletion of a character (d) and substituting one character for another (s). 
This Levenshtein distance can then be used to determine the effectiveness measurements of character error 
rate (CER) and percentage correctness measures (Read et al., 2001).  The Levenshtein distance is divided by 
the number of characters to obtain the CER (Equation 1). 

  = 		 ∗ 100 (1) 

The character error rate is a negative effectiveness measurement which can then be transformed into the 
positive measurement percentage correctness measurement, denoted by PCM (Read et al., 2001): 
 PCM = 100 –CER (2) 
The number of characters typed per second (Equation 3) have been successfully used as measures of efficiency 
for word processing text entry methods (Read et al., 2001).  
 !	"#!
  = 		  (3) 
$	!%
For an accurate measure of characters per second, the number of characters should be measured as the 
number of characters in text input – 1 (MacKenzie, 2002). This is to compensate for the fact that the 
preparation time for the first character cannot be accurately measured. Therefore, by discarding a single 
character the time is measured as starting from the preparation time for the second character until the last 
character is input.  
Other text entry measurements include keystrokes per character (Soukoreff & MacKenzie, 2001), gestures per 
second (Wobbrock, 2007) and corrected, uncorrected and total errors as well as efficiency of error correction 
(Soukoreff & MacKenzie, 2003). Apart from these there are a number of other measurements which are 
available for use but were not deemed applicable to this study. 
In order to test the usability of the proposed text entry method only one measurement per usability 
component was considered namely, characters per second as an efficiency measurement and character error 
rate as an effectiveness measurement. User demographics (Appendix F) and the satisfaction experienced by 
users will be measured through the use of a questionnaire (Appendices G and H) and learnability will be 
monitored as the progress that is made over a number of sessions. 
71 
 
Chapter 3 Experimental Design and Methodology 
3.4.3.3 Experimental design 
Each participant will be required to complete a task list comprising representative tasks in a word processor 
environment (Table 3.5).  
Table 3.5: Word processor functions and text entry testing task list 
Task  Task text Task type Skill being tested 
no 
1 Underline the first three lines of text using speech Line selection and Selection  
recognition. formatting Formatting 
2 Italicise the last three lines of text using the Line selection and Navigation 
keyboard. formatting Selection and formatting 
3 Use speech recognition to select all the text in the Select all text and Selection 
document and delete it.  remove Editing/Manipulation 
4 Enter the following phrase using eye gaze and Typing Typing 
speech recognition: <randomly selected phrase> 
5 Use speech recognition to select the first two Select words and Navigation 
words of the sentence and make them bold. format Selection  
Formatting 
6 Use the keyboard to select the whole paragraph Select all text and Selection 
and then to cut it. remove Editing/Manipulation 
7 Type the following phrase using the keyboard: Typing Typing 
<randomly selected phrase> 
8 Use the keyboard to select the first two words in Select words and Navigation 
the document and then to make them bold. format Selection 
Formatting 
9 Use speech recognition to italicise all the text. Select all and Selection  
format Formatting 
10 Enter the following phrase using eye gaze and Typing Typing 
speech recognition: <randomly selected phrase> 
11 Paste the previously cut text using the keyboard. Paste Editing/Manipulation 
12 Undo your previous action using speech Undo Editing/Manipulation 
recognition. 
13 Paste the previously cut text using speech Paste Editing/Manipulation 
recognition. 
14 Undo your previous action using the keyboard. Undo Editing/Manipulation 
15 Type the following phrase using the keyboard: Typing Typing 
<randomly selected phrase> 
16 Use speech recognition to select the last word and Select word and Navigation 
to copy it. copy Selection 
Editing/Manipulation 
17 Use the keyboard to insert the copied word after Position and paste Navigation 
the second word. Editing/Manipulation 
18 Enter the following phrase using eye gaze and Typing Typing 
speech recognition: <randomly selected phrase> 
19 Use the keyboard to select the last word and to Select word and Navigation 
copy it. copy Editing/Manipulation 
20 Use speech recognition to insert the copied word Position and paste Navigation 
after the second word. Editing/Manipulation 
 
72 
 
Chapter 3 Experimental Design and Methodology 
The task list was compiled in such a way as to include elements of all of the required functions. Each task will 
also specify the interaction technique which must be used to complete the task. In order to perform 
meaningful comparative analysis between the traditional methods of input and the proposed interaction 
technique, similar tasks will have to be performed using both these interaction techniques. For example, the 
participant will be required to position the cursor correctly, select some text and then copy it using either the 
keyboard or the mouse. A similar task will then have to be performed using speech commands. A window 
containing the task instruction will be overlaid on the top-right hand corner of the Word window. 
There are also a number of text entry tasks, some of which must be completed using the keyboard and others 
using speech recognition and eye gaze. For each of these tasks, the sentence which must be input will be 
randomly chosen from a set of 35 pre-selected phrases. These phrase sets were chosen from the 500 as 
determined by MacKenzie and Soukoreff (2003) to be everyday phrases which are commonly used. The subset 
was selected based on its applicability to the setting of the study as well as for their length, character set and 
level of difficulty. Phrases with unusual words or hard to spell words were omitted from the list as this did not 
conform to the aim of the study. Some phrases were, however, included based on whether they contained 
double letters. In order to emulate the study conducted by Karl et al. (1993) which also tested text entry within 
a word processor and verbal commands to complete formatting, the phrases were also chosen for their 
memorability and familiarity so that participants could easily remember the phrase to be entered and would 
not have to continually refer back to a hard copy or, in this case, the instructional window. 
The phrase set which was chosen was: 
• Time to go shopping • The daring young man 
• Elephants are afraid of mice • Prepare for the exam in advance 
• You must be getting old • A dog is the best friend of a man 
• I agree with you • That is a very odd question 
• Take a coffee break • Rapidly running short on words 
• Fish are jumping • Dolphins leap high out of the water 
• I am wearing a tie and a jacket • Nothing finer than discovering a treasure 
• All together in one big pile • The location of the crime 
• Goldilocks and the three bears • Luckily my wallet was found 
• My favourite web browser • They watched the entire movie 
• Have a good weekend • Sit at the front of the bus 
• This is a very good idea • The elevator door appears to be stuck 
• User friendly interface • With each step forward 
• It is very windy today • Wishful thinking is fine 
• Zero in on the facts • What goes up must come down 
• Universities are too expensive • Insurance is important for bad drivers 
• A picture is worth many words • Tell a lie and your nose will grow 
• The dog buried the bone 
 
Analysis of the phrase set reveals the descriptive statistics tabulated in Table 3.6. The five most frequently 
occurring letters are summarised in Table 3.7. 
The letter “E” and “T” are the first and second most frequently used letters in English text, with “I”, “A” and 
“O” are in the group of third most frequently used letters in English text (Oxford Dictionary, 2011). When 
analysis letter frequency in main dictionary entries and not printed text, the letter “E” is the most frequently 
used letter, “A” the second, “I” the fourth, “O” the fifth and “T” the sixth (Oxford Dictionary, 2011). Therefore, 
the phrase set selected for use closely resembles the English language in terms of letter frequency. 
73 
 
Chapter 3 Experimental Design and Methodology 
 Table 3.6: Descriptive statistics for phrase set 
Descriptive statistic Measurement 
Number of phrases 35 
Minimum characters in phrase (excluding spaces) 13 
Maximum characters in phrase (excluding spaces) 36 
Minimum characters in phrase (including spaces) 20 
Maximum characters in phrase (including spaces) 41 
Average ± Standard Deviation of characters in phrase (excluding spaces) 22.29±5.27 
Average ± Standard Deviation characters in phrase (including spaces) 26.6±6.16 
Number of words 186 
Number of unique words 100 
Minimum word length 1 
Maximum word length 12 
 
Table 3.7: Frequencies with which letters occur in selected phrase set 
Letter Frequency 
E 72 
T 58 
I 52 
A 48 
O 47 
 
The most frequently occurring words are: 
Table 3.8: Most frequently occurring words in selected phrase set 
Word Frequency 
the 14 
a 11 
is 7 
of 5 
in 3 
are 3 
and 3 
very 3 
 
The words “the”, “a”, “is”, “of”, “in” and “and” are in the top 7 most commonly used words in English, while 
th
the word “are” is the fifteenth most commonly used word and “very” the 127  most commonly used English 
word (Fry, Kress & Fountoukidis, 1993; word-english, 2003). Therefore, the most frequently occurring words 
in the phrase set used are also some of the most commonly used words in the English language. 
As suggested by MacKenzie and Soukoreff (2003), no capitalisation or punctuation will be expected when 
entering the phrase sets. 
In order to gauge learnability of the application using the new interaction techniques, the participants will be 
required to complete 10 sessions so that their progress can be measured. Each participant will have to attend 
one session per week. During the first session, participants will complete a pre-test questionnaire, designed to 
elicit the participant’s expertise with a word processor. Since expertise is a measurement of both the 
frequency and length of use (Rosson, 1984), the questions will be phrased so as to gauge both of these aspects 
of the participant’s expertise. A number of other demographics will also be captured through this 
74 
 
Chapter 3 Experimental Design and Methodology 
questionnaire. The complete questionnaire can be seen in Appendix F. During subsequent sessions, the 
participants will have to complete the task list as set out in Table 3.5 to the best of their ability. A post-test 
questionnaire (Appendix H) will be administered during the final session to gauge user satisfaction after 
prolonged use of the application. 
 
3.5 Statistical analysis 
Inferential statistics will be used to analyse the data and investigate the stated hypotheses. The notation H0 
will be used to denote the null hypothesis or the hypothesis of no difference. Where there are multiple null 
th
hypotheses under investigation, the notation H0,i will be used to denote the i  null hypothesis. For example, 
H0,1 is the first null hypothesis and H0,2 is the second null hypothesis. 
Since most of the data captured will follow a within-subjects experimental design, the data will be in the form 
of repeated measures, where a number of measures are taken for each participant over a number of sessions, 
for the same condition. In order to determine whether there is a significant difference between measures 
taken over a number of sessions with the same participants, a repeated measures analysis must be used. 
Normality tests will be performed on the data in order to verify whether the data is normally distributed or 
not. If the data is normal a suitable parametric test will be used, otherwise an equivalent non-parametric test 
will be used.  
Since the same participants will be tested multiple times, paired tests can be used for analysis. Where there 
are only two dependent variables, a paired t-test can be used. The non-parametric equivalent of a paired t-test 
is the sign test (Whitley and Ball, 2002) or the Wilcoxon test (Motulsky, 1995). If there are multiple 
independent variables or more than two dependent variables then a within-subjects repeated measures 
ANOVA can be used.  
Repeated measures ANOVA assumes normality and sphericity of data (Minke, 1997). A non-parametric 
alternative to the repeated measures ANOVA is the Friedman test (Mutolsky, 1995). However, since the 
ANOVA is robust to violations of normality, it will be used in all instances regardless of the distribution of the 
data. Mauchley’s sphericity test will be used to verify whether assumption of sphericity is met before analysis 
commences. Sphericity can be compared to the homogeneity of variance in the between-groups ANOVA (Field, 
1998). When the assumption of sphericity is not met, there are a number of corrections which can be applied 
to the degrees of freedom, such as the Geisser-Greenhouse correction, the Huynh-Feldt correction and the 
Lower Bound correction (StatSoft, 2010). The closer the Greenhouse-Geisser estimate is to 1 the more 
homogeneous the variances of differences are and the more spherical the data is (Field, 1998). For a 
Greenhouse-Geisser estimate larger than 0.75, the more conservative Huyn-Feldt adjusted correction should 
be applied (Girden, 1992; Nimon & Williams, 2009). If the assumption of sphericity is not met, the F ratio is 
positively biased which increases the chances of rejecting falsely (Maxwell & Delaney, 2004). Some texts 
advocate reporting the results of both the univariate and multivariate approaches (Minke, 1997), while others 
advise using the multivariate approach wherever possible (StatSoft, 2010) since these tests are not dependent 
on the assumption of sphericity (Field, 1998). Therefore, when adjusted corrections are required, the results of 
the multivariate test will also be reported for the sake of completeness and to ensure that the results are not 
compromised by the lack of sphericity. 
Line graphs will used to graphically illustrate the measures over time and for comparison purposes. Where line 
graphs are used, the vertical bars will denote a 95% confidence interval. The confidence interval is calculated 
as follows (StatSoft, 2010): 

&'()*+*	(,*-./0 =	x̄ ± tα/2( ) √
75 
 
Chapter 3 Experimental Design and Methodology 
When significant differences are detected, post-hoc tests will be conducted to determine which factors were 
the cause of the significant difference. Tukey’s HSD test will be used for post-hoc tests. 
Where the data is summarised in a frequency table, the Chi-Square test will be used to test whether two 
variables are independent of one another (StatSoft, 2010). 
Before the commencement of any statistical analysis, outliers will be removed from the data. A data point will 
be determined to be an outlier if one of the following conditions held (StatSoft, 2010): 
1. data point value > UBV + o.c.*(UBV - LBV) 
2. data point value < LBV - o.c.*(UBV - LBV) 
th th
where UBV was the upper value or 75  percentile, LBV was the lower bound or the 25  percentile and o.c. was 
the outlier coefficient which was set to 1.5.  
 
3.6 Summary 
The development of a customisable, highly inclusive multimodal interface for a mainstream word processor 
was discussed in this chapter. The development of such an application, together with all the functionality 
which was provided, was discussed in detail. The resulting application offers a multimodal interface which can 
be adjusted to meet the needs and circumstances of a wide group of users. However, the usability of the 
proposed interaction techniques must still be established. To this end, the experimental methodology which 
will be followed in order to answer the remaining research questions was also elaborated upon. The settings 
for the user testing as well as the tasks which must be completed were discussed together with the procedure 
which will be followed during user testing. The following chapter will discuss the results of the first of these 
experiments, namely the feasibility testing which was conducted. 
 
  
76 
 
 
CHAPTER 4 
FEASIBILITY TESTING OF THE MULTIMODAL INTERFACE 
 
4.1 Introduction 
In the previous chapter the development of the multimodal interface for Word was discussed. Using a variety 
of tools, eye gaze and speech were incorporated into the popular word processor application. It was found 
that in this way it was possible to create a highly customisable, hands-free multimodal interface for a 
mainstream application.  
Since the multimodal interface has successfully been included in Microsoft Word, the next step was to 
determine if the interface was a feasible solution. A feasibility test is aimed at determining whether the 
proposed interface is viable and whether it could offer a potentially usable interface to any users. Therefore, 
contrary to a more formal usability study, it does not require that objective measurements be captured and 
analysed statistically. A number of Computer Science lecturers, who were familiar and comfortable within the 
field of Human-Computer Interaction (HCI), were approached to complete a questionnaire designed to elicit 
their reaction to, and assessment of the proposed system. This chapter reports on the results of this feasibility 
testing. 
 
4.2 Participants 
It was established in Chapter 3 that five participants were sufficient for such a study. The five participants 
included in this study all had extensive experience in the field of HCI. These particular participants were 
targeted for inclusion in this sample based on the fact that they had the experience and foresight which is 
required to objectively judge the long-term viability of an application. Since the aim of this phase of the study 
was not to test the actual usability of the application but rather to determine whether such an application has 
potential, sampling was performed under this premise. Two of the participants specialised in HCI research, one 
in e-learning, one in web programming and security and the other in general computer programming. 
Moreover, all had experience in the field of HCI and were comfortable with the terminology and principles of 
this field. Two of the participants were female and the average age of the participants was 33.8 (standard 
deviation = 6.6). 
 
4.3 Tasks 
Participants were required to complete the pre-test questionnaire as contained in Appendix A. This made it 
possible to determine if they did in fact fall into the required target group. Once this was verified, participants 
were given a short demonstration on the use of the application and the various customisable features which 
were available. The command list in Table 3.1 was provided to them and Appendix B offered some suggestions 
for them to familiarise themselves with the application. They were then allowed to interact freely with the 
application, encouraged to explore the various options and to make full use of the functionalities offered by 
the application. Thereafter they all completed the post-test questionnaire in Appendix C. 
 
77 
 
Chapter 4 Feasibility Testing of the Multimodal Interface 
4.4 Limitations 
One limitation of this study is the small sample size, which has the consequence that statistical analysis could 
not be done on the results of the questionnaire. Therefore, the results will be reported on in an anecdotal 
manner which precludes the possibility of generalising to the population. Additionally, the fact that all 
participants were involved in HCI research could mean that responses to the questionnaire are biased. Should 
a different sample be approached then the results may differ substantially. Furthermore, the questions posed 
are very subjective in nature and also mean that they not be able to be generalised and may be very biased.  
However, since the purpose of this part of the study was to set the stage for the larger study and to provide 
substantiation for the study these limitations were not of high consequence.  
 
4.5 Results 
Results of the questionnaires will not be statistically analysed since the sample size is too small for meaningful 
analysis. Instead the responses will be inspected and reported on in an anecdotal manner.  
Four of the respondents were initially excited by the system, predominantly due to the possibilities it offered 
to disabled users in particular. Interaction with the application did not change the viewpoint of any of the 
participants – including the single respondent who was sceptical about the use of such a system. The main 
concern of this participant was the lack of control one has over one’s eye gaze. Unintentional and natural 
movements of eye gaze are hard to suppress and do cause a dilemma for researchers. However, there are 
possibilities of overcoming such shortcomings, such as smoothing and stabilisation algorithms which can be 
applied to the eye gaze response. The Midas touch problem can also be neutralised to a degree through the 
correct adjustment of dwell time settings or use of mechanical activation. Therefore, the main concern of the 
sceptic can be countered.  
All participants agreed that the time has come for a paradigm shift in user interface design and that a 
multimodal interface may be the way of the future. To this end, the combination of eye gaze and speech was 
met with enthusiasm and optimism although concern was voiced that some practice would be needed to 
become accustomed to the interface. Since most new systems require some level of training and practice in 
order to master, this should not be considered a problem unique to the proposed system. Additionally, the 
naturalness and intuitive means of interaction could prove more of an advantage to the application although 
there are of course the inherent problems associated with the interaction techniques, such as the Midas 
touch, which have to be compensated for. 
Chart 4.1 shows the spread of the responses to a number of questions designed to gauge the subjective 
feelings of the participants towards the system. 
In most instances the response to the combination of eye gaze and speech and to the provision of a 
multimodal interface in a mainstream application are positive. However, the majority of the respondents felt 
that they could not navigate through the document very easily but that with extended use and practice they 
could improve their speeds to a satisfactory level. 
Since there are various options available to facilitate interaction with eye gaze, namely dwell time, blinking, 
look-and-shoot as well as combining it with speech, participants were asked  to rank these interaction 
techniques according to their own preference and then also according to their usability within a word 
processor. One participant did not answer these questions. The highest (n = 3) preference was for look-and-
shoot, followed by the combination of eye gaze and speech, followed by blinking and finally dwell time. This is 
quite understandable as look-and-shoot and the combination of eye gaze and speech probably offer the 
highest feeling of control over the system. For instance, blinking is a natural occurrence and is difficult to 
78 
 
Chapter 4 Feasibility Testing of the Multimodal Interface 
control. While the system does require a more pronounced blink to be executed before it responds, this could 
still lead to a feeling that blinking is not an allowable action. This perception would presumably change as 
more practice with the system is gained. Similarly, dwell time might appear to place more strain on the eyes as 
it requires a stable gaze to be maintained on an object of interest for a specified time. 
In terms of usability, look-and-shoot was seen as the most usable of the interaction techniques, followed by 
the combination of eye gaze and speech. Dwell time and blinking tied for the third most usable interaction 
technique. 
In conclusion, participants see the value of such an application, most notably as a means of absorbing disabled 
users into the mainstream user group. All respondents were in agreement that a multimodal interface for 
Word is a desirable development and that eye gaze and speech offered a viable multimodal solution. Overall, 
look-and-shoot was viewed as the preferred method of interaction as well as the most usable interaction 
technique of the four. While this might be true, some users may not have the mobility to perform such an 
action and should they have a limited vocabulary the combination of eye gaze and speech could offer a more 
usable and quicker means of interaction than dwell time and blinking. 
Do you think that speech recognition and eye tracking offer a 80 20
more usable long term solution?
Do you think that practice with the verbal commands will 50 50
increase your efficiency?
Did the verbal commands allow you to navigate easier than 20 80
what you normally do?
Is the use of verbal commands beneficial for long term use? 60 40
Is the use of eye gaze as an interaction technique beneficial for 60 40
long term use?
Are the options presented in this interface a viable solution for 100 0
disabled users?
Do you like the idea that multimodality will provide more 100 0
flexibily for mainstream users?
Do you think that the multimodal interface will assist 80 20
mainstream users to work efficiently under varying conditions?
From the viewpoint of a mainstream user would you prefer 
that multimodal options were available in a package such as 80 20
Microsoft Word?
As a long term solution do you think this combination is a viable 80 20
option?
Is speech and eye gaze a viable option for multimodal 100 0
interface?
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Yes No
 
Chart 4.1: Responses to questionnaire 
79 
 
Chapter 4 Feasibility Testing of the Multimodal Interface 
4.6 Conclusion 
This chapter reported on the results of the responses of a number of lecturers, all of whom were comfortable 
with HCI research, to the proffered multimodal interface for Word. The overall reaction was a positive one, 
particularly in light of the possibilities it will offer to disabled users who will be able to interact with a 
mainstream application. Additionally, in the opinion of the participants, a multimodal solution of eye gaze and 
speech is possibly an acceptable solution for a diverse group of users and not only for disabled users. 
Since these results indicate optimistic subjective feelings towards the system, it now remains to be seen 
whether the objective usability goals can be met. The next chapter will report on the first of these, namely the 
testing of eye gaze and speech when used for pointing-and-clicking purposes. 
  
80 
 
 
CHAPTER 5 
ANALYSIS OF EYE GAZE AND SPEECH TO SIMULATE A POINTING DEVICE 
 
5.1 Introduction 
The previous chapter reported on the feasibility review which was conducted. Overall, the reaction of the 
reviewers to the vision and voice word processor was positive. It must now be investigated as to whether this 
word processor will provide a suitable experience to end-users. The first step in this investigation is to 
determine whether the combination of eye gaze and speech can effectively be used to simulate a pointing 
device. Section 1.5 identified that one of the common means of interaction with a word processor was through 
the selection of icons and menus via the use of a pointing device. Furthermore, when using the onscreen 
keyboard, eye gaze and speech will be used to select keys, therefore serving the purpose of pointing and 
clicking. The usability of a pointing device can be established by using the International Standards Organisation 
(ISO) pointing device test, ISO 9241-9, to compare interaction techniques.  
This chapter will report on the results of such an ISO test which was conducted using an eye-tracker and 
speech recognition as an alternative to a mouse. A number of trial conditions were arranged and a group of 
participants each completed a few sessions with each trial. The results will give an indication of the viability of 
eye gaze and speech recognition to simulate a pointing device. The standard ISO measurement of throughput 
was analysed. Thereafter, the time was analysed as a separate variable since the nature of some of the 
interaction techniques could negatively influence the throughput. Some other measurements, which were 
identified in previous chapters, were also analysed, namely the number of target re-entries, the number of 
incorrect target acquisitions, the number of incorrect clicks and the time to selection of the designated target. 
All these measurements will be defined and then analysed in terms of their differences between interaction 
techniques. The chapter provides an in-depth analysis of all these measurements in an attempt to scrutinize 
the viability of the proposed interaction techniques. 
 
5.2 Participants  
A convenience sample was used for this part of the study as participants were sourced from the student 
population of the University of the Free State. Participants were expected to be competent with the computer 
and mouse and therefore they were chosen based on their exposure to a computer. Consequently, the sample 
consisted of senior students (no first year students). 
Each participant completed three sessions and each session consisted of all fourteen trials, which will be 
discussed in the following section. For the first session, there were 20 participants. However, five of the 
participants did not return for the second or third sessions for various reasons. Therefore, in total there were 
15 participants who completed all three sessions and only the data of these fifteen participants was included 
in the final analysis. 
Eleven of the participants were male and 4 were female. The average age of the participants was 22.3 
(standard deviation = 1.9). Analysis of the indicated computer expertise on the pre-test questionnaire 
(Appendix D) indicated that all participants could be ranked as having high computer expertise. Similarly, all 
participants ranked as having high mouse expertise. All participants indicated that they had neither eye-
tracking nor speech recognition experience. No previous eye-tracking or speech recognition was required as it 
81 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
was considered more desirable that participants did not have prior exposure to these technologies. Therefore, 
the sample was well set up with regard to computer and mouse expertise and lack of eye-tracking and speech 
recognition experience. 
 
5.3 Trials 
As discussed in Chapter 3 there were 14 trial conditions (Table 3.4). To recap the trial conditions were as 
follows: 
1. Eye gaze and speech with no added features. 
2. Since the accuracy of eye gaze and speech could influence the ability of users to select small targets, 
magnification of the target can be achieved in two ways, namely: 
a. through “invisible” expansion of the target by using a gravitational well. This means that the 
selectable area of the button is larger than the size of the target as it is portrayed in the 
interface. When a stable eye gaze is detected within this larger area, the eye gaze is pulled 
onto the target, thereby creating a gravitational well; and 
b. through magnification of the area directly under the eye gaze. 
3. Visual feedback is essential in the use of eye-tracking as a pointing device; therefore different means 
of visual feedback were investigated. 
4. As the primary pointing device used in Word, the mouse was included for comparative purposes. 
While it is acknowledged that the interaction technique is essentially the mouse or eye gaze and speech, a 
distinction will also be made based on the visual feedback that was provided as it cannot be assumed that this 
did not affect user performance. Proper statistical analysis will be performed on the data before consideration 
will be given to disregarding the visual feedback used. 
Since the mouse is the interaction technique which is regarded as the benchmark, the condition of MM is not 
of importance to the scope of the study and will not be included in the analysis. 
 
5.4 Sessions 
Each session required that the participant complete all fourteen trials. Participants were randomly assigned a 
Latin square condition for each session. No participant was assigned to the same Latin Square condition more 
than once. 
The first session commenced with each participant giving informed consent to participate in the research 
study. Participants then completed pre-test questionnaire (Appendix D). The purpose of the study was then 
explained and a quick overview of the procedure and requirements was given. The first and second sessions 
were, unfortunately, spaced 10 weeks apart as the laboratory was occupied between the sessions. The second 
and third sessions were two days apart. Admittedly, the uneven spacing between the sessions is not advisable. 
However, since the factor of interest was the interaction technique and not the learning effect over time it was 
decided that the period between session 1 and session 2 should not have an effect on the analysis or results of 
the interaction technique. At the end of the third session, each participant completed a post-test 
questionnaire (Appendix E) to gauge subjective reaction to the proposed interaction technique.  
Since the participants were sourced from the university, they were all fluent in either English or Afrikaans as 
these are the tuition languages of the university. Each session was conducted in the language that the 
participant was most comfortable with. The participants received an incentive, in the form of a gift voucher, 
for each session they completed.  
82 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
Some problems were experienced with the equipment during the second session of two participants. This 
resulted in no data being captured for some of the mouse tasks for these two participants. The M(F) task of 
two participants had to be discarded and the M(I) task for one of these participants also had to be excluded 
from the analysis. Therefore, their mouse throughput will be calculated from only two sessions. The remainder 
of the trials as well as the rest of the participants will all be calculated using all three session’s data. 
 
5.5 Device movement  
Complete statistical analysis of the data must be performed, but first the data was inspected visually. For this 
purpose, the path of each trial was traced and drawn as an overlay over the trial setup. The images were only 
extracted for illustration purposes to serve as a visual representation of the trial completion. They also serve to 
give an idea of how much movement was required to complete the task. The set of images contained in Figure 
5.1 below show some paths that were traced as the participant completed a trial. The first image (a) is for the 
mouse, with framed feedback. The second (b) is for eye gaze and speech with no gravitational well, the third 
(c) is for a trial with a gravitational well. The fourth image (d) is for a magnification trial. The blue lines signify 
mouse movement and the red lines eye movement.  
The black circles indicate captured data points. It is important to note that not all data points are represented 
as it was not essential to capture the actual movement; therefore not all data points were saved, although 
they were reacted to in real-time during the course of the test. The numbers indicate the sequence of the 
clicks and are placed at the exact position where the click occurred. Note that when the gravitational well is 
activated, it is effectively possible that the click occurs outside of the button since the button is essentially 
larger than what is actually represented on the screen. Also take note that the buttons are redrawn as 
graphical squares for the purposes of this representation and that during the trial the buttons were displayed 
as standard Windows buttons. These squares are redrawn to be the exact size of the buttons during the trial 
and the decision to draw them as squares was purely to facilitate a simplified drawing process. 
Figure 5.1 are visual representations generated for a participant who did not struggle to complete the trials. 
They represent the first session for this participant.  
    
Figure 5.1(a): Mouse path and (b) Eye-tracking (without gravitational well) path of a single participant 
 
83 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
    
Figure 5.1(c): Eye-tracking (with gravitational well) path and (d) Eye-tracking, with magnification, path of a single 
participant 
 
The following image set is for the same tasks as above but for another participant. This time it is for a 
participant who struggled more with the trials. These were also for the first session. 
    
Figure 5.2(a): Mouse path and (b) Eye-tracking (without gravitational well) path of a single participant 
 
    
Figure 5.2(c): Eye-tracking (with gravitational well) path and (d) Eye-tracking, with magnification, path of a single 
participant 
84 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
These images are representative of the other participants and there was no noticeable difference between the 
sessions. As can clearly be seen from the figures, the amount of movement needed when using eye-tracking 
and speech recognition, appears to be substantially more than when using a mouse. The magnification does 
not seem to lessen the effort required to click on all the buttons and in the case of the second participant, 
actually seemed to hamper the selection more. The cleaner lines when using the gravitational well appears to 
almost rival those of the mouse, which would seem to indicate that the gravitational well may enhance the use 
of the eye-tracking as an interaction technique. These qualitative observations must still be statistically 
verified. 
 
5.6 Analysis of the throughput 
The output of the ISO tests is throughput, as calculated by the formula explained in Section 3.4.2.1. The 
throughput was calculated for each interaction technique per participant and per session. The first step in the 
analysis was to determine if M(F) and M(I) could be combined. Following this it will be determined if ETS(F) 
and ETS(I) can be combined into a single interaction technique. The same analysis will be conducted with the 
gravitational well interaction technique. Once the allowable combinations have been conducted, the final 
analysis will be performed on all the remaining interaction techniques. The subsequent section will discuss this 
analysis in depth.  
 
5.6.1 Combining the interaction techniques 
Since the mouse was tested with both an inverted and a framed visual feedback cue, it was decided to first 
determine if this influenced the participant’s mouse throughput. Should this not be the case, then the trials for 
the mouse could be grouped together for further analysis. The reasoning behind this is that the mouse should 
be the device which remained fairly consistent throughout the trials. The initial setup of the trials did not allow 
the throughput to be calculated per mouse interaction technique using the ISO standard for each session. In 
retrospect, it would be advisable not to have changed the visual feedback cue since it could be argued that this 
could affect the throughput. Since the mouse was the benchmark interaction technique it was also not entirely 
necessary to provide different visual feedback cues but rather just to change the size of the targets and 
subsequently calculate throughput for the mouse using different target sizes. Since all participants were 
competent with the mouse, it was unlikely that they would “significantly learn” to use the mouse better over 
the three sessions, therefore it was decided that the throughput could be calculated using the data for the 
three sessions and then only distinguish between the mouse interaction techniques. 
A paired t-test was used to determine if there was a significant difference between throughput for the 
inverted colour feedback and the framed button feedback. The throughput for these two conditions was 
calculated over all sessions per participant. The following hypothesis was formulated: 
1. H0: There is no difference between the throughput achieved with the mouse when using inverted 
colour feedback or framed button feedback. 
The normality of the data was first verified using the Shapiro-Wilks normality tests. Since the p-values for 
framed feedback (W = 0.97, p > 0.05) and inverted feedback (W = 0.95, p > 0.05) were larger than the α-value, 
it could be accepted that the data was normally distributed and a paired t-test could be used for analysis. 
The average throughput per participant was calculated over the three sessions for both M(F) and M(I). The 
mean for M(F) was 4.164 and that of M(I) was 4.170 and the standard deviation was 0.535 and 0.539 
respectively. Since p > 0.05, the null hypothesis cannot be rejected (t = 2.14, df = 14, p > 0.05).  Therefore, the 
conditions M(F) and M(I) can be combined into a single condition M.  
85 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
The new groupings are summarised in the table below. If the mouse column is not ticked then the interaction 
technique is eye gaze and speech and if the framed column is not ticked, then the visual feedback is inverted. 
 Table 5.1: Grouped interaction techniques 
Group Mouse Pixel Size Framed Magnification Gravitational well 
M  30    
 24    
ETS(F)  30    
 40    
ETS(I)  30    
 40    
ETSG(I)  30    
 40    
ETSG(F)  30    
 40    
ETSM  24    
 30    
 40    
 
Since the throughput of the mouse was not affected by the type of visual feedback given, it was considered 
worthwhile to determine if the other interaction techniques were affected by the visual feedback. The average 
throughput for these interaction techniques is tabulated below: 
Table 5.2: Average throughput for all interaction techniques prior to consolidation 
 Session 1 Session 2 Session 3 
ETS(I) 0.633 0.752 0.851 
ETS(F) 0.643 0.839 1.027 
ETSG(I) 1.813 2.197 2.389 
ETSG(F) 1.883 1.994 2.378 
 
Chart 5.1 gives a visual representation of the data. 
Since the interaction techniques of ETS(I) and ETS(F) follow the same trend, it was considered worthwhile to 
determine whether these two interaction techniques differed significantly in terms of throughput. Since the 
only difference between them was the type of visual feedback, such an analysis could determine whether the 
type of visual feedback has an impact on the throughput levels achieved. If not, then the two interaction 
techniques could be amalgamated into a single interaction technique. The same logic applies to the interaction 
techniques of ETSG(I) and ETSG(F). 
In order to determine whether the interaction techniques of ETS(F) and ETS(I) could be combined into a single 
interaction technique, the following hypothesis was formulated: 
1. H0: There is no difference between the throughput of the different interaction techniques. 
 
 
86 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
 
Chart 5.1: Average throughput for all interaction techniques prior to consolidation 
Since each participant completed all three sessions, a repeated-measures ANOVA (section 2.5) was required 
for analysis. Therefore, the normality and sphericity of the data had to be verified before analysis could 
commence. Shapiro-Wilks was used to test normality and it was found that the data was not normally 
distributed. The results of the Shapiro-Wilks test are summarised in Table 5.3. According to the Kolmogorov-
Smirnov normality test all three sessions were normally distributed and since the ANOVA is robust to violation 
of normality (Section 3.5), it was decided to continue with the ANOVA analysis. Mauchley’s sphericity test 
2
confirmed that the assumption of sphericity was met (χ (2) = 1.846, p > 0.05).  
 Table 5.3: Results of normality tests for ETS(F) and ETS(I) throughput 
Session 1 Session 2 Session 3 
W = 0.880, p < 0.05 W = 0.908, p < 0.05 W = 0.879, p < 0.05 
 
A within-subjects repeated measures ANOVA showed that the null hypothesis could not be rejected at an α-
level of 0.05 (F(1, 28) = 0.456, p > 0.05). Therefore, the type of visual feedback did not affect the throughput 
that could be achieved with the eye gaze and speech interaction technique. Therefore, the two interaction 
techniques could be consolidated into a single interaction technique which will be called ETS. 
The next step was to perform the same analysis for ETSG(F) and ETSG(I). In order to determine whether the 
feedback affected the throughput achieved when the gravitational well was present, the following hypothesis 
was formulated: 
1. H0: There is no difference between the throughput of the different interaction techniques. 
The normality of the data was confirmed using the Shapiro-Wilks test, the results of which are tabulated 
2
below. The data met the assumption of sphericity (χ (2) = 2.375, p > 0.05) for the repeated-measures ANOVA. 
 Table 5.4: Results of normality tests for ETSG(F) and ETSG(I) 
Session 1 Session 2 Session 3 
W = 0.969, p > 0.05 W = 0.985, p > 0.05 W = 0.976, p > 0.05 
  
It was found that the null hypothesis cannot be rejected at an α-level of 0.05 (F(1, 28) = 0.185, p > 0.05). 
Therefore, the type of visual feedback does not affect the throughput of the interaction technique.  
87 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
As a result of not being able to reject H0, the interaction techniques of ETSG(I) and ETSG(F) can be considered 
to be one interaction technique. Subsequent analyses need not distinguish between them and all throughput 
measurements for these techniques will be combined and referred to as ETSG. The subsequent section will 
provide an in-depth discussion of this analysis. 
 
5.6.2 Analysing throughput 
In light of the findings in the previous section, the throughput was recalculated for each participant and for 
each session, taking into account that M(F) and M(I) as well as  ETS(I) and ETS(F) were respectively combined 
as M and ETS and that ETSG(I) and ETSG(F) were now only ETSG. The underlying averages now apply to the 
interaction techniques: 
Table 5.5: Average throughput for the consolidated interaction techniques for all sessions 
 Session 1 Session 2 Session 3 
M 3.77 4.17 4.36 
ETS  0.52 0.67 0.82 
ETSG 1.68 1.90 2.17 
ETSM 0.48 0.48 0.61 
 
Chart 5.2 gives a graphical representation of the throughput for the interaction techniques over all three 
sessions. A 95% confidence interval is superimposed on each point. 
 
Chart 5.2: Average throughput for consolidated interaction techniques over all sessions 
 
2
Table 5.6 summarises the results of the normality tests for the data. The sphericity was confirmed (χ (2) = 
0.387, p > 0.05) before analysis commenced in order to inspect the following hypotheses: 
1. H0,1: The interaction technique has no effect on the throughput achieved. 
2. H0,2: The session has no effect on the throughput achieved. 
The results of the Shapiro-Wilks normality test are tabulated below, with the Kolmogorov-Smirnov also shown 
where the Shapiro-Wilks failed to verify normality of the data. 
88 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
 Table 5.6: Results of the normality tests conducted on the throughput of all interaction techniques 
 Session 1 Session 2 Session 3 
Shapiro-Wilks W = 0.847, p < 0.05 W = 0.841, p < 0.05 W = 0.865, p < 0.05 
Kolmogorov-Smirnov d = 0.180, p < 0.05 d = 0.175, p > 0.05 d = 0.193, p < 0.05 
  
 
A within-subjects repeated measures ANOVA showed there was significant interaction between the session 
and interaction technique (F(6, 108) = 2.598, p < 0.05) therefore separate analyses had to be conducted by 
isolating each factor in turn.  
H0,1 could be rejected at an α-level of 0.05 for all three sessions (Table 5.7). During all three sessions, it was 
only ETS and ETSM that did not differ significantly from each other. Since the mouse has, on average, the 
highest throughput it can be said that the mouse yields the best throughput of the tested interaction 
techniques. ETSG differed significantly from both ETS and ETSM and since, on average, ETSG has a higher 
throughput it implies that ETSG allows for faster, more accurate pointing than the other two interaction 
techniques. 
Table 5.7: Results of separate ANOVA on throughput for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
ANOVA F(3, 56) = 325.024,  F(3, 54) = 309.927,  F(3, 56) = 255.637,  
p < 0.05 p < 0.05 p < 0.05 
 
H0,2 could also be rejected for all interaction techniques (Table 5.8), with the first session and second session 
differing significantly from the third session for ETS, ETSG and ETSM. When evaluating the mouse, the first 
session differed significantly from the second and third sessions. The expected average throughput rate of a 
mouse is between 3.5 and 4.5 bps (Soukoreff & MacKenzie, 2004). Therefore it could be said that the observed 
values correspond to the expected values, although they are slightly higher for the final two sessions. The fact 
that even the throughput of the mouse increased would suggest that some improvement could be attributed 
to a learning effect for the test and not the pointing device. The use of the Latin Square allows the probability 
of the learning effect to be negated in terms of preventing one interaction technique outperforming the others 
by virtue of its position in the test as opposed to its actual usability. Therefore, if the learning effect is to be 
solely attributed to the users becoming accustomed to the test and not the interaction technique, then the 
level of improvement should be somewhat consistent for all interaction techniques.  
 Table 5.8: Results of separate ANOVA on throughput for sessions 
 M ETS ETSG ETSM 
ANOVA F(2, 24) = 10.872,  F(2, 28) = 4.269,  F(2, 28) = 14.253,  F(2, 28) = 5.064,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
 
Inspection of the significant difference indicates that there is not uniform improvement. Consequently, the 
level of improvement is not similar for the interaction techniques across the sessions and therefore 
improvement cannot be said to be caused only due to familiarisation with the test but also with the interaction 
techniques. 
It could also be said that the unintentional long gap between session 1 and session 2 did not negatively impact 
performance. ETSG and ETS also appear to increase at a more rapid rate than ETSM, which signifies that they 
were easier to learn than ETSM. It would be interesting to increase the number of sessions in order to see 
89 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
whether a more prolonged exposure could eventually lead to a throughput which is comparable to that of the 
mouse for one of the interaction techniques or whether they will eventually plateau at a steady throughput 
after a few sessions.  
 
5.7 Analysis of the time 
The next measurement to be analysed was the time taken to complete the trial. Although throughput includes 
both speed and accuracy it seems prudent to analyse the time taken to complete the trials separately. This is 
especially important since some of the interaction techniques allow for larger “clickable” areas, which are not 
visible to the participant. This effectively means that the target can be selected without the eye gaze actually 
being positioned precisely on the button. This could negatively influence the throughput because of the 
measurement of the distribution of the click position. Consequently, the time taken for each interaction 
technique was calculated per session for each participant. This analysis should provide more insight into the 
usability of the different interaction techniques and serve as confirmation of the throughput results. An 
analysis was first performed to determine whether there was a possibility of combining interaction techniques, 
similar to that for the throughput. 
 
5.7.1 Combining the interaction techniques 
Similar interaction techniques were first isolated and analysed to determine whether the time for these could 
be combined. The following procedure was followed to determine if it was allowable to combine the 
interaction techniques: 
• Averages for each session and each interaction technique were calculated. 
• Averages were visually inspected for conformity to a general trend. 
• If such a trend was identifiable, the normality of the data was investigated. Shapiro-Wilks was used as 
the preferred normality test, with an additional test being conducted with Kolmogorov-Smirnov if 
necessary. Since time is rarely normal and the problem can easily be solved by converting the 
measurements to 1/time, this standard practice was employed in an attempt to normalise the data if 
the original time measurements were not normal. In the instances where time was converted to 
1/time, normality of 1/time was tested again. 
• Sphericity of the data was confirmed using Mauchley’s sphericity test. 
• A within-subject repeated-measures ANOVA was conducted on the data to determine if the 
interaction technique significantly influenced the throughput yielded during the trials. 
This procedure was followed for M(F) and M(I), ETS(F) and ETS(I) as well as for ETSG(F) and ETSG(I). For the 
sake of brevity, only the final results of the ANOVA will be reported here. The following hypotheses were 
formulated: 
1. H0,1: There is no difference between the time required to complete the trials when using M(F) or M(I). 
2. H0,2: There is no difference between the time required to complete the trials when using ETS(F) or 
ETS(I). 
3. H0,3: There is no difference between the time required to complete the trials when using ETSG(F) or 
ETSG(I). 
The results of the ANOVA show that H0,1 could not be rejected therefore there was no difference between the 
time required for the trials when using M(F) and M(I) (F(1, 24) = 2.530, p > 0.05). Since the null hypothesis H0,1 
could not be rejected, the visual feedback did not affect user performance in terms of time. The interaction 
90 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
techniques of M(F) and M(I) can therefore be combined into a single interaction technique of M for the 
analysis of the time taken to complete the trials. 
H0,2 could also not be rejected at an α-level of 0.05 (F(1, 28) = 0.002, p > 0.05) which means the time required 
for trial completion with ETS(F) did not differ significantly from that for ETS(I). 
Similarly, H0,3 could not be rejected at an α-level of 0.05 (F(1, 28) = 0.141, p > 0.05), which implies that ETSG(F) 
and ETSG(I) did not require significantly different times to complete the trials.  
As a consequence of not rejecting H0,2, the interaction techniques of ETS(F) and ETS(I) can be combined into a 
single interaction technique for subsequent time analysis. This interaction technique will be referred to as ETS. 
ETSG(F) and ETSG(I) can likewise be combined into a single interaction technique, called ETSG for further time 
analysis. 
The conclusion that can be drawn from this analysis is that the visual feedback does not affect the time in 
which users can complete point-and-click tasks when using eye gaze and speech recognition as an interaction 
technique. This holds both for when a gravitational well is present or not. 
 
5.7.2 Analysing Time 
The average times for the trial completion for all three sessions are tabulated below, with the graphical 
representation in Chart 5.3 below. 
Table 5.9: Average times for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
M 1.254 1.187 1.155 
ETS 9.071 7.556 5.527 
ETSG 1.919 1.745 1.576 
ETSM 10.226 8.867 6.900 
 
 
Chart 5.3: Average times for consolidated interaction techniques 
From Table 5.9 and Chart 5.3 it can clearly be seen that the mouse and ETSG have the most rapid completion 
times for the trials. Furthermore, ETSG performs on a level which appears to be comparable to that of the 
91 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
mouse. Although ETS and ETSM have a much longer completion time, there is noticeable improvement 
between the times achieved over the sessions. 
The session data was tested for normality and it was found that none of the sessions were normally 
distributed. The times were therefore converted to 1/time and normality was tested again. The results of the 
Shapiro-Wilks tests for both time and 1/time are tabulated below: 
 Table 5.10: Results of normality tests on time for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
Time W = 0.786, p < 0.05 W = 0.822, p < 0.05 W = 0.834, p < 0.05 
1 / Time W = 0.884, p < 0.05 W = 0.894, p < 0.05 W = 0.890, p < 0.05 
  
Neither time nor 1/time was normally distributed for any of the sessions. Since the ANOVA is robust to 
violations of normality (Section 3.5) and 1/time measurements are generally “more normalised” than time, the 
analysis was conducted on the 1/time measurements. Mauchley’s sphericity test indicated that the 
2
assumption of sphericity was met (χ (2) = 0.717, p > 0.05). The following hypotheses were formulated for the 
analysis of the time: 
1. H0,1: The interaction technique has no effect on the trial times. 
2. H0,2: The trial times did not differ between the sessions. 
H0,1 could be rejected since the resultant p-value was less than the significance level (F(3, 53) = 305.767, p < 
0.05). The subsequent conclusion is that the interaction technique significantly affects the time required to 
complete the trials. Similarly, H0,2 could be rejected at a significance level of 0.05 (F(2, 106) = 24.128, p < 0.05), 
with the conclusion that the trial session significantly affected the time required to complete the trial. 
Post-hoc tests were conducted in order to determine which of the sessions and interaction techniques 
contributed to the significant difference. Tukey’s HSD test indicates that the first session differs significantly 
from both session 2 and session 3. The second session also significantly differs from the third session. Since the 
session averages decrease as time went by, it can be concluded that there is an element of learning over time. 
Therefore, the longer the user is exposed to the interaction technique, the more they learn to use the device 
and the faster they are able to complete a point-and-click trial.  
The times achieved with the mouse differed significantly from all other techniques. The averages indicate that 
the mouse has lower times, which means the mouse is notably faster than the other interaction techniques. 
ETSG also differs significantly from the other interaction techniques and closer inspection of the times 
achieved indicates that the gravitational well significantly decreases the time required to complete the trials 
when using eye gaze and speech. ETS and ETSM do not differ significantly from each other. Accordingly, while 
the presence of a gravitational well significantly enhances the performance of ETS, the magnification of targets 
does not.  
According to Chart 5.3, the average times of the mouse remain fairly consistent, while those of ETSG 
systematically improve over time. ETSG also comes very close to the average times of the mouse. Although 
analysis shows that the difference between these two interaction techniques is still significant it might be 
worthwhile to extend the number of trials in order to investigate whether ETSG can ever achieve times which 
are comparable to the mouse. It would be expected that the times of the interaction technique, similar to the 
mouse, would eventually reach a fairly constant performance time. Whether this performance time will be 
higher, the same or less than the mouse will have to be determined. Furthermore, ETS experienced quite a 
rapid drop in time from session 2 to session 3 and it would be interesting to determine whether this rate of 
performance improvement will continue over an extended period. 
92 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
5.8 Analysis of other measurements 
Apart from the throughput and time to complete the trials, a number of other measurements can also be used 
to compare the effectiveness of eye gaze and speech as a pointing device to that of the mouse. These 
measurements were identified and discussed in Section 3.4.2.1. The measurements deemed appropriate to 
this study were target re-entries, incorrect target acquisitions, incorrect clicks and time to selection. 
 
5.8.1 Target re-entries 
Target re-entries are defined as the number of times the designated target was acquired and then lost before 
the user was able to click on it. 
 
5.8.1.1 Combining the interaction techniques 
Following the same procedure as in the previous analyses (without conversion to 1/measurement), similar 
interaction techniques were isolated and analysed in order to determine if they could be combined. The 
following null hypotheses were formulated: 
1. H0,1: There is no difference between the number of target re-entries for M(F) and M(I). 
2. H0,2: There is no difference between the number of target re-entries for ETS(F) and ETS(I).  
3. H0,3: There is no difference between the number of target re-entries for ETSG(F) and ETSG(I). 
It was found that H0,1 could not be rejected at an α-level of 0.05 (F(1, 25) = 0.038, p > 0 .05). Therefore, the 
number of target re-entries did not differ significantly between M(F) and M(I) and these two interaction 
techniques can be regarded as a single technique in subsequent analyses. H0,2 could also not be rejected at a 
significance level of 0.05 (F(1, 28) = 0.001, p > 0.05), therefore there is no notable difference between the 
number of target re-entries for ETS(I) and ETS(F) and they can be combined into ETS for further analysis. 
Similarly, H0,3 could not be rejected as the p-value was larger than the α-level of 0.05 (F(1, 28) = 0.222, p > 
0.05). This result allowed ETSG(F) and ETSG(I) to be consolidated into a single interaction technique of ETSG. 
The total number of target re-entries per session and per interaction technique (after consolidation), is shown 
Table 5.11, together with a number of other descriptive statistics. 
Table 5.11 clearly shows that ETSM had a much higher average of target re-entries than any of the other 
interaction techniques. ETS has a lower average, followed by ETSG and finally the mouse had the least number 
of target re-entries. Whether these differences are significant will be determined in the following section. 
 
5.8.1.2 Analysis of target re-entries 
The final analysis of the target re-entries therefore had the four interaction techniques of M, ETS, ETSG and 
ETSM. The average number of target re-entries per session and per interaction technique is summarised in 
Table 5.12. 
  
  
93 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
 Table 5.11: Descriptive statistics for the number of target re-entries 
  Session 1 Session 2 Session 3 
M Total 111 77 87 
Mean 7.4 5.1 5.8 
Min 3 0 0 
Max 18 10 12 
Std Dev 3.9 3.6 2.9 
ETS Total 1577 1063 646 
Mean 105.1 70.9 43.1 
Min 22 10 3 
Max 364 347 99 
Std Dev 116.9 87.2 28.0 
ETSG Total 294.0 184.0 133.0 
Mean 19.6 12.3 8.9 
Min 6 0 0 
Max 81 37 23 
Std Dev 19.0 9.3 5.5 
ETSM Total 2873 2635 2101 
Mean 191.5 175.7 140.1 
Min 59 96 60 
Max 702 306 347 
Std Dev 166.9 66.0 70.9 
 
 
Table 5.12: Average target re-entries for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
M 7.4 5.1 5.8 
ETS 110.4 75.2 45.3 
ETSG 19.6 12.3 8.9 
ETSM 191.5 175.7 140.1 
 
Chart 5.4 shows the plot of the interaction techniques against the session. It can clearly be seen that ETSM has 
a much higher number of target re-entries than the remainder of the interaction techniques. 
For the sake of conciseness, the results of the normality tests will henceforth not be reported even though the 
tests were conducted in all instances. 
A repeated-measures ANOVA was used to test the following hypotheses: 
1. H0,1: The interaction technique has no effect on the number of target re-entries. 
2. H0,2: The session has no effect on the number of target re-entries. 
94 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
 
Chart 5.4: Average target re-entries for consolidated interaction techniques 
 
2
The data failed to meet the assumption of sphericity (χ (2) = 16.136, p < 0.05), therefore, the adjusted Geisser-
Greenhouse and Huynh-Feldt corrections will also be reported for the within-effects variable. These results, as 
well as the multivariate results, are tabulated below: 
Table 5.13: Complete repeated-measures analysis results for consolidated interaction techniques 
 ANOVA Geisser-Greenhouse Huynh-Feldt Multivariate 
Interaction F(3, 56) = 32.071,    
technique p < 0.05 
Session F(2, 112) = 4.249, F(1.6, 89.3) = 4.249, F(1.7, 96.4) = 4.249, F(2, 55) = 3.783,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(6, 112) = 1.003, F(4.7, 89.3) = 1.003, F(5.2, 96.4) = 1.003, F(6, 110) = 0.966, 
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
From Table 5.13 it can be concluded that H0,1 could be rejected at an α-level of 0.05. Therefore, the interaction 
technique plays a significant role in the number of target re-entries. H0,2 could also be rejected at a significance 
level of 0.05, therefore the session did significantly affect the number of target re-entries. 
Tukey’s HSD was used for post-hoc analysis on the interaction technique. Results indicated that ETSM differed 
significantly from all other interaction techniques. Since ETSM has the highest number of target re-entries, on 
average, the use of ETSM can be said to result in significantly more target re-entries. This would imply that it is 
much harder to achieve a prolonged stable gaze on a button, such that the required verbal command can be 
issued, when the magnification tool is activated, than for any other interaction technique. 
ETS also differs significantly from the mouse and ETSG. ETSG does not differ significantly from the mouse, 
which means that ETSG is able to perform comparably with the mouse in terms of target re-entries. This would 
imply that the positioning of an ETSG interaction technique is just as stable as for a mouse. The higher 
occurrence of re-entries for ETS indicates that the focus slips off the target fairly easily, although there is much 
improvement over the sessions.  
  
95 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
5.8.2 Incorrect target acquisitions 
Incorrect target acquisitions are defined as the number of times a target, which is not the designated target, is 
acquired. This means that in the event of the eye-tracker and speech being used, each time a button receives 
enough focus to give visual feedback, the incorrect target acquisitions are incremented, provided that the 
focused button is not the designated target. The number of incorrect target acquisitions are counted as those 
targets which are acquired after the designated target has been acquired. Therefore, the incorrect targets that 
are acquired cannot be attributed to normal searching for the designated target. For the purposes of this 
measurement, only the eye gaze and speech interaction techniques will be included in the analysis as the 
number of incorrect target acquisitions for the mouse interaction techniques were always zero. 
Once again, before the all-inclusive analysis was analysed, similar interaction techniques were analysed in 
isolation to determine the viability of combining them into a single interaction technique. 
 
5.8.2.1 Combining the interaction techniques 
The same procedure as with the previous measurements was followed to determine whether the similar 
interaction techniques could be combined. 
The following hypotheses were evaluated: 
1. H0,1: There is no difference between the number of incorrect target acquisitions when using ETS(F) or 
ETS(I). 
2. H0,2: There is no difference between the number of incorrect target acquisitions when using ETSG(F) 
or ETSG(I). 
The null hypothesis, H0,1 could not be rejected at an α-level of 0.05 (F(1, 28) = 0.040, p > 0.05), which means 
that ETS(F) and ETS(I) can be combined into a single interaction technique ETS. Furthermore, H0,2 could also 
not be rejected (F(1, 28) = , p > 0.05). Therefore, ETSG(F) and ETSG(I) can be combined into ETSG. 
The final conclusion of this analysis is that the visual feedback does not significantly impact the number of 
incorrect target acquisitions for any of the investigated interaction techniques. Therefore, a complete analysis 
will use a combined ETS(F) and ETS(I) as well as a combined ETSG(F) and ETSG(I). 
Descriptive statistics of the resulting interaction techniques can be seen in Table 5.14. 
 
5.8.2.2 Analysis of incorrect target acquisitions 
The next step was to conduct an ANOVA on all the interaction techniques together, now that some of them 
could be combined in order to simplify the analysis. The averages of the three interaction techniques are 
shown in Table 5.15 with the graphical representation below that (Chart 5.5). 
From these it can clearly be seen that the number of incorrect target acquisitions becomes steadily less as the 
amount of exposure increases. ETS has a higher number of incorrect target acquisitions than the other two 
interaction techniques, but it does appear to improve at a faster rate than the other two. These differences 
must be statistically analysed to determine whether they are significant. 
2
The assumption of sphericity (χ (2) = 0.302, p > 0.05) was, however, met. Therefore no corrections are 
required on the repeated-measures ANOVA for the following hypothesis: 
H0: The interaction technique has no significant impact on the number of incorrect target acquisitions. 
96 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
The results of the ANOVA are summarised in Table 5.16. 
Table 5.14: Descriptive statistics for the number of incorrect target acquisitions 
  Session 1 Session 2 Session 3 
ETS Total 734 577 345 
 
Mean 48.9 38.5 23 
Min 21 7 2 
Max 95 80 61 
Std Dev 24.0 23.9 16.9 
ETSG Total 216 178 72 
 
Mean 14.4 11.9 4.8 
Min 4 0 0 
Max 54 36 12 
Std Dev 12.9 12.0 3.8 
ETSM Total 315 173 187 
 
Mean 21.0 11.5 12.5 
Min 1 1 0 
Max 84 25 35 
Std Dev 25.2 8.6 11.7 
 
 
Table 5.15: Average incorrect target acquisitions for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
ETS 48.9 38.5 23.0 
ETSG 14.4 11.9 4.8 
ETSM 21.0 11.5 12.5 
 
 
 
Chart 5.5: Average incorrect target acquisitions for consolidated interaction techniques 
97 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
Table 5.16: Results of ANOVA on incorrect target acquisitions for consolidated interaction techniques 
Factor ANOVA 
Interaction technique F(2, 42) = 19.327,  
p < 0.05 
Session F(2, 84) = 12.046,  
p < 0.05 
Interaction technique × Session F(4, 84) = 2.246,  
p > 0.05 
   
The results of the ANOVA confirm that there are significant differences between the interaction techniques 
and the sessions. Tukey’s HSD post-hoc test was used to determine which interaction techniques contributed 
to the significant result in each case. 
All the sessions differed significantly from one another. Since only ETSM actually increased slightly in session 3, 
it can be surmised that the incorrect target acquisitions are lessened at a significant rate over time. ETS in 
particular has a sharp decrease and it may be beneficial to increase the number of sessions so that it can be 
properly analysed whether it can ever reach the low values of ETSG or ETSM. In terms of the interaction 
techniques, ETS differs significantly from both ETSG and ETSM. ETSG and ETSM do not differ significantly from 
each other.  
Observations made of the participants while they were completing the tasks could provide an explanation for 
this. Many participants soon realised that when struggling to focus on a button it was sometimes easier to 
focus on another button at a suitable distance from the designated one. It was not necessary to focus on this 
other button for a protracted time. Participants would then look back at the designated button and the 
extended movement seemed to provide more accuracy in focusing on the desired target rather than trying to 
“fine-tune” the selection within a small area around the designated button. The smoothing algorithm could 
have contributed to this as small movements within a certain radius are interpreted as a single fixation. Since 
the gravitational well effectively pulls the selection onto the nearest target once the “pointer” is within a 
certain distance, it becomes easier to focus on a target and no fine-tuning is required. This could explain the 
reason why ETSG has such a low number of acquisitions compared to ETS.  
ETSM also has a lower rate and this could possibly be attributed to participants rather trying to fine-tune the 
selection when using the magnification. Since the buttons appear larger, participants may have perceived the 
fine-tuning process to be easier since a larger target could create the impression that it can be easily acquired. 
The high incidence of target re-entries coupled with the low number of incorrect target acquisitions may serve 
to substantiate the suspicion that fine-tuning was the preferred method for ETSM.  
The similar pattern for ETS, regarding target re-entries and target acquisitions also corroborates the claim that 
the participants preferred to employ the use of a shifting of their eye gaze to focus on another button and 
then returning to the designated button. Closer inspection of the averages for ETS shows that incorrect target 
acquisitions constituted approximately half the number of re-entries for each session. This could indicate that 
participants would attempt to re-acquire the designated target and when they were unable to achieve a stable 
selection, resorted to focusing on another target before attempting to select the designated target – in 
contrast to the strategy employed with ETSM. 
The reason for this could be that the magnification disturbs the users while they adjust their gaze and they are 
unwilling to move their gaze substantially because they perceive this to require more effort when 
magnification is activated. Another reason for the different strategies could be attributed to the fact that the 
magnification tool that was used has in-built visual feedback which allows the user to get an approximation of 
their eye gaze position, which is centred in the magnified area. Since this feedback is present, the user may 
98 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
feel that fine-tuning is a better option since they can determine how close they are to the target, which is not 
the case with ETS. With ETS they will know they have lost the target but not how close they are to re-acquiring 
it, hence they feel more secure glancing at another target, establishing position and then looking at the 
required target again until they can maintain a stable eye gaze. Therefore, to slave a cursor to the eye gaze 
may be disruptive but in this instance it could tentatively be said that it may have provided useful information 
to the users. However, the evidence suggests that it in no way increased the efficiency or effectiveness of 
target selection and therefore it is not recommended for use. 
The average number of target re-entries for ETSG was roughly the same as the average incorrect target 
acquisitions for ETSG. This could provide evidence that when using ETSG, the target was easier to acquire and 
keep the focus long enough to issue the required command. Since the buttons were effectively larger it would 
make sense that they were easier to focus on for a prolonged period of time. 
 
5.8.3 Incorrect clicks 
Incorrect clicks are determined as the number of times a target that was not the designated target was clicked 
during a trial.  
 
5.8.3.1 Combining the interaction techniques 
Following the same procedure as the preceding sections, similar interaction techniques were first inspected on 
their own to determine whether they could be combined for further analysis. 
1. H0,1: The number of incorrect clicks is not significantly different between M(F) and M(I). 
2. H0,2: The number of incorrect clicks is not significantly different between ETS(F) and ETS(I). 
3. H0,3: The number of incorrect clicks is not significantly different between ETSG(F) and ETSG(I). 
H0,1 could not be rejected (F(1, 25) = 0.706, p > 0.05). Therefore, the number of incorrect clicks is not 
significantly affected by the type of feedback given with a mouse. Subsequently, M(F) and M(I) will no longer 
be distinguished  between and a single interaction technique of M will be used.  
Similarly, the null hypothesis for ETS(F) and ETS(I) could not be rejected at a significance level of 0.05 (F(1, 28) 
= 0.998, p > 0.05). Therefore, these two interaction techniques could be combined into ETS for the number of 
incorrect clicks.  
At a significance level of 0.05, H0,3 could not be rejected (F(1, 28) = 0.183, p > 0.05). Therefore, the difference 
between the number of incorrect clicks for ETSG(F) and ETSG(I) do not differ significantly and they can be 
combined into a single technique, called ETSG. 
As a result of these findings, there were now only four interaction techniques. Descriptive statistics are given 
for these interaction techniques in Table 5.17. 
 
5.8.3.2 Analysis of incorrect clicks 
The final analysis of the number of incorrect clicks included the four interaction techniques M, ETS, ETSG and 
ETSM, the averages of which are tabulated in Table 5.18. 
 
  
99 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
 Table 5.17: Descriptive statistics for the number of incorrect clicks 
  Session 1 Session 2 Session 3 
M Total 2 3 2 
 Mean 0.1 0.2 0.1 
 Min 0 0 0 
 Max 1 1 1 
 Std Dev 0.4 0.4 0.4 
ETS Total 21 24 15 
 Mean 1.4 1.6 1 
 Min 0 0 0 
 Max 4 7 2 
 Std Dev 1.2 1.8 0.8 
ETSG Total 63 40 23 
 Mean 4.2 2.7 1.5 
 Min 0 0 0 
 Max 9 6 5 
 Std Dev 2.7 2.0 1.4 
ETSM Total 10 17 13 
 Mean 0.7 1.1 0.9 
 Min 0 0 0 
 Max 2 4 4 
 Std Dev 0.8 1.3 1.2 
 
 
Table 5.18: Average number of incorrect clicks for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
M 0.1 0.2 0.1 
ETS 1.4 1.6 1.0 
ETSG 4.2 2.7 1.5 
ETSM 0.7 1.1 0.9 
 
Chart 5.6 plots the averages of the interaction techniques against the sessions.  
Almost surprisingly, it is ETSG that has the highest average number of incorrect clicks of all the interaction 
techniques. Due to the fact that ETSG had the lowest number of incorrect target acquisitions this observation 
might be considered unexpected. 
The following hypothesis was formulated to determine whether the interaction technique affected the 
number of incorrect clicks: 
H0: The number of incorrect clicks is not significantly influenced by the interaction technique. 
2
The assumption of sphericity (χ (2) = 4.157, p > 0.05) was met. The within-subjects, repeated-measures ANOVA 
indicated that there was significant interaction between the two factors (F(6, 112) = 4.689, p < 0.05). 
Therefore, the factors had to be examined in isolation so as to control for the other factor. Since the factor of 
100 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
interest is the interaction technique, the session was controlled for and three separate ANOVAs were run in 
order to test for significant differences. The results of these ANOVAs are summarised in Table 5.19. 
  
 
Chart 5.6: Average number of incorrect clicks for consolidated interaction techniques 
 
Table 5.19: Results of separate ANOVA on incorrect clicks  for consolidated interaction techniques 
 Session 1 Session 2 Session 3 
ANOVA F(1, 3) = 20.362,  F(1, 3) = 6.953,  F(1, 3) = 4.890,  
p < 0.05 p < 0.05 p < 0.05 
  
As can be seen from the results in the table, the interaction techniques resulted in significantly different 
number of incorrect clicks for all three sessions. Post-hoc analysis was conducted next so that the significant 
differences could be accounted for. 
During session 1, ETSG differed significantly from all other techniques. Since the number of incorrect clicks was 
highest for ETSG, it can be concluded that ETSG caused significantly more incorrect clicks during the first 
session. During the second session, ETSG differed significantly from the mouse and ETSM and in the third 
session only from the mouse. These results clearly show that ETSG results in the highest number of incorrect 
clicks. Although continued practice allowed ETSG to have a comparable number of incorrect clicks to ETS and 
ETSM, its performance could not match that of the mouse over the three sessions. This indicates that some 
learning did take place over the three sessions. 
Natural eye movement may provide an explanation for the observed difference. Participants could acquire the 
target and then issue a verbal command while already starting to look at the next target (for all eye gaze and 
speech interaction techniques). Since the use of the gravitational well increases the speed with which a target 
can be acquired, this often meant that by the time the speech engine recognised the command, the next 
target had already been acquired. This could account for the high number of incorrect targets for ETSG. These 
findings also confirm previous findings that the fixation immediately prior to the action or command being 
issued is usually occurs on the object of interest (Land & Tatler, 2009; Maglio et al., 2000) 
Since ETSG had significantly lower incorrect target acquisitions coupled with this finding of more incorrect 
clicks creates the following dilemma. The use of the gravitational well increases the possibility of correctly 
101 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
acquiring a target and maintaining a stable gaze on the target. This is evidenced by the fact that other eye gaze 
and speech interaction techniques caused participants to first glance away, acquire another target and then 
glance back. However, the fact that a gravitational well is present together with human tendency to start 
glancing at the next object of interest whilst still issuing a command to the current target, means that the next 
target is acquired far quicker than when no gravitational well is present. This causes the next target to be 
incorrectly clicked on with higher frequency for ETSG. Since participants started moving their eye gaze away 
from the buttons before the speech command had been executed for all eye gaze interaction techniques, it 
would be assumed that for ETS and ETSM, which pose greater difficulty in target acquisition, the participant 
would inadvertently have caused a click somewhere on the application form which was not a clickable area. 
This would correspond to the measurement of number of missed clicks as discussed in Section 3.4.2.1. 
Unfortunately, this measurement was not captured during these tests. Further research must be done in order 
to determine if this proposition is true. 
 
5.8.4 Time to selection 
Time to selection is measured as the time between when the final target acquisition is performed and when 
the target is actually clicked or selected. The final target acquisition is defined as the last time the button 
receives focus before being clicked. The same procedure as with the other measurements was followed. All 
data was measured in milliseconds and then converted to 1/time to optimise the possibility of normalisation.  
 
5.8.4.1 Consolidating the interaction techniques 
The following hypotheses were formulated: 
1. H0,1: There is no difference between the time to selection when using M(F) and M(I). 
2. H0,2: There is no difference between the time to selection when using ETS(F) and ETS(I). 
3. H0,3: There is no difference between the time to selection when using ETSG(F) and ETSG(I). 
H0,1 could be rejected at an α-level of 0.05 (F(1, 25) = 0.473, p > 0.05), therefore M(F) and M(I) can be 
considered as a single interaction technique for the purposes of the time to selection analysis. Similarly, both 
H0,2 (F(1, 28) = 0.647, p > 0.05) and H0,3 (F(1, 28) = 0.406, p > 0.05) could be rejected with the result that ETS(F) 
and ETS(I) can be combined into ETS and ETSG(F) and ETSG(I) into ETSG. 
Table 5.20 tabulates the descriptive statistics for the four resulting interaction techniques. 
  
 Table 5.20: Descriptive statistics for time to selection 
  Session 1 Session 2 Session 3 
M Mean 321.8 324.7 321.3 
 Std Dev 48.3 93.1 77.6 
ETS Mean 1154.5 1187.5 1136.6 
 Std Dev 216.1 151.4 107.9 
ETSG Mean 1097.7 1104.5 1011.4 
 Std Dev 134.5 99.3 110.8 
ETSM Mean 1093.8 1123.4 1036.2 
 Std Dev 213.4 129.0 125.2 
 
102 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
5.8.4.2 Analysis of time to selection 
Table 5.21 provides a summary of just the averages of the time to selection, with Chart 5.7 below that giving a 
graphical depiction of the averages. 
 Table 5.21: Average time to selection 
  Session 1 Session 2 Session 3 
M Mean 321.8 324.7 321.3 
ETS Mean 1154.5 1187.5 1136.6 
ETSG Mean 1097.7 1104.5 1011.4 
ETSM Mean 1093.8 1123.4 1036.2 
 
 
Chart 5.7: Average time to selection 
As can be expected, the interaction techniques of ETS, ETSG and ETSM all have similar times to selection. Since 
they all have the same selection device, i.e. speech it seems appropriate that they maintain similar averages. 
On average, it took participants approximately 300 milliseconds to select a target with the mouse once it had 
been acquired. The eye gaze and speech techniques averaged a selection time of over 1 second. 
The following hypothesis was inspected: 
H0: The interaction technique has no effect on the time to select an acquired target. 
2
The assumption of sphericity (χ (2) = 8.665, p < 0.05) was also violated, therefore the adjusted corrections will 
be reported where applicable. 
Table 5.22: ANOVA results of time to selection 
 ANOVA Geisser-Greenhouse Huynh-Feldt Multivariate 
Interaction F(3, 54) = 196.605,    
technique p < 0.05 
Session F(2, 108) = 0.543, F(1.7, 93.9) = 0.543, F(1.9, 102.1) = 0.543, F(2, 53) = 0.869,  
p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Interaction F(6, 108) = 0.909, F(5.2, 93.8) = 0.909, F(5.7, 96.4) = 102.1,  F(6, 106) = 0.586, 
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
103 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
From the results in Table 5.22, H0,1 could be rejected meaning that the interaction technique does have an 
effect on the selection time. Post-hoc tests indicate that the mouse differs significantly from all other 
techniques. The selection time for the mouse is, on average, lower than those for the other interaction 
techniques; therefore the selection time for the mouse is significantly faster than selection times for the other 
interaction techniques. This result has serious implications for the acceptance of eye gaze and speech as an 
interaction technique since it shows that even if the final acquisition can be performed in a comparable time 
to the mouse, thereafter the time to select will still take significantly longer. There is no noticeable 
improvement in the time to select over the session which is not surprising since this factor hinges on the 
issuing on a verbal command. The chance that a participant can improve the speed at which they utter a 
command, in reaction to a selection, is highly improbable. 
 
5.8.4.3 Further analysis of selection times 
This discovery led to the question being posed as to whether the final acquisition of the target differed 
significantly between the interaction techniques. Inspection of the overall trial times showed that only ETSG 
averaged in the region of the mouse. Therefore, this analysis was confined to the interaction techniques of the 
mouse and eye gaze and speech with a gravitational well. The time in milliseconds to achieve a final acquisition 
of the designated target was calculated for M(F), M(I), ETSG(F) and ETSG(I). The final acquisition was 
determined as the acquisition immediately prior to selection of the designated target. Analysis showed that 
M(F) and M(I) could be combined into M (F(1, 25) = 3.123, p > 0.05). Similarly, ETSG(F) and ETSG(I) could be 
combined into ETSG (F(1, 28) = 0.174, p > 0.05). 
Descriptive statistics for M and ETSG are tabulated below: 
 Table 5.23: Descriptive statistics for final acquisition times 
  Session 1 Session 2 Session 3 
M Mean 917.4 855.8 826.6 
 Std Dev 104.6 113.5 130.5 
ETSG Mean 817.4 640.0 564.5 
 Std Dev 246.6 204.9 209.6 
 
Chart 5.8 clearly shows that, on average, ETSG has a lower final acquisition time than the mouse. 
The following hypothesis can be formulated: 
H0: The interaction technique has no effect on the time to final target acquisition. 
2
The data conformed to the assumption of sphericity (χ (2) = 0.415, p > 0.05). A within-subjects repeated-
measures ANOVA showed that there was significant interaction between the two factors (F(2, 54) = 4.155, p < 
0.05), therefore separate ANOVAs were conducted where the session was controlled for. These results are 
summarised in Table 5.24. 
According to Table 5.24, there was a significant difference between the interaction techniques for all three 
sessions. ETSG had a lower average final target acquisition in all three sessions; therefore ETSG is significantly 
faster in terms of final target acquisition than the mouse. 
 
104 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
 
Chart 5.8: Average time to final selection for M and ETSG 
 
 Table 5.24: Separate ANOVA results for final target acquisition 
Session 1 Session 2 Session 3 
F(1, 28) = 4.627, p < 0.05 F(1, 27) = 10.875, p < 0.05 F(1, 28) = 19.511, p < 0.05 
  
For overall time to target selection, the mouse is significantly faster than ETSG. However, when selection time 
is divided into final target acquisition and time to selection, it was found that ETSG has a significantly faster 
final target selection but a significantly slower time to selection. Therefore, the time to selection is so much 
slower that the overall time differs significantly. Section 5.7.2 posed the question as to whether ETSG could, 
over time, achieve the same speeds as the mouse. It would now seem that final target acquisition would have 
to improve dramatically to achieve this. Since acquisition times did improve over time, this remains a viable 
possibility for improved overall selection times. An additional option could be to explore another selection 
type, such as using look-and-shoot with the press of a keyboard key. This method could possibly provide a 
faster time to selection than the uttering of a verbal command. 
 
5.9 Subjective device assessment 
The final measurement which will be evaluated is that of subjective satisfaction. Since the subjective feeling of 
users does not necessarily mirror their actual performance, it is imperative that both be tested. Participants 
each completed the post-test questionnaire (Appendix E) which focused on the use of the eye-tracker and 
speech recognition to select targets. The questionnaire results are tabulated below. Each response was rated 
on a 5-point scale. The responses in Table 5.25 are grouped according to the number of responses for the 
lower range of the scale (1 and 2), the neutral or midpoint of the scale (3) and the higher range of the scale (4 
and 5). 
Nine participants felt that the force required to move the device was neither too low nor too high, but the 
average score indicates that the force required might be slightly high. The majority of the participants felt that 
the movement of the device was a little rough and that the mental effort required was too high. While physical 
effort may be low, accurate pointing is difficult and the operation speed is too fast. Neck fatigue was 
experienced by only a small number of the participants and the device appears to be fairly comfortable to use. 
In summary, it seems as though the use of the eye-tracker and speech recognition is relatively easy.  
105 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
Furthermore, ten of the participants felt that with enough practice they could match the mouse speeds when 
using eye gaze and speech. Thirteen indicated that they enjoyed using eye gaze and speech recognition as a 
pointing device. Only five felt that magnification enhanced the use of the mouse, while ten participants felt 
that the magnification made the use of eye gaze more difficult. 
In terms of the preferred appearance of the buttons, 11 participants preferred the large buttons and nine 
preferred the framed visual feedback over the inverted colour visual feedback.  
 
Table 5.25: Results of the device assessment questionnaire 
Question Answer group Number of Mean Standard Mode 
answers deviation 
Actuation force Low 1 
Neutral 9 3.3 0.6 3.0 
High 6 
Smoothness Rough 6 
Neutral 4 2.9 1.1 2.0 
Smooth 5 
Mental effort  Low 2 
Neutral 5 3.5 0.8 4.0 
High 8 
Physical effort  Low 5 
Neutral 6 2.9 0.8 3.0 
High 4 
Accurate pointing Easy 3 
Neutral 4 3.6 1.1 5.0 
Difficult 8 
Operation speed was Fast 7 
Neutral 2 2.9 1.0 2.0 
Slow 6 
Neck fatigue None 11 
Neutral 3 1.8 1.0 1.0 
High 1 
General comfort: Uncomfortable 5 
Neutral 4 3.1 1.2 2.0 
Comfortable 6 
Overall, the input device was Difficult 4 
Neutral 2 3.3 1.1 4.0 
Easy 9 
 
 
5.10 Summary of findings 
This chapter had a large amount of analysis and this section will attempt to summarise all the findings. It was 
found that the type of visual feedback did not affect the performance of the interaction techniques with 
106 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
regard to any of the measurements that were analysed. Therefore, even though users may prefer a certain 
type of visual feedback, their performance will not be impacted by the choice of visual feedback. 
The mouse has a significantly higher throughput than the other interaction techniques. The use of a 
gravitational well causes a significant improvement to the throughput of eye gaze and speech as an interaction 
technique. Magnification does not positively influence the throughput of eye gaze and speech as an 
interaction technique. 
The mouse is also significantly faster than the other interaction techniques and the use of a gravitational well 
causes a significant decrease in point-and-click time for eye gaze and speech. Furthermore, magnification of 
the targets does not increase the time performance of eye gaze and speech. 
In terms of the other measurements that were analysed, the following discoveries were found: 
1. The use of magnification causes a significant increase in the number of target re-entries made while 
selecting a target. This implies that when magnification is activated, it becomes harder to maintain a 
stable gaze on a target. This is despite the fact that the target is essentially much larger than with any 
other interaction technique. The high number of target re-entries, together with the relatively low 
number of incorrect target acquisitions suggest a desire by the participants to fine-tune the selection, 
possibly due to the impression that the larger the target is, the easier it is to acquire. 
2. The use of a gravitational well increased the number of incorrect clicks, possibly due to the fact that 
the eye already starts moving to the next target while the speech command is issued. While the 
number of incorrect clicks does not improve, it was able to be comparable to the mouse in only three 
sessions. Target re-entries and incorrect target acquisitions were kept very low with this interaction 
technique. Moreover, the averages between target re-entries and incorrect target acquisitions remain 
approximately the same, providing further evidence of the assertion (when compared to other 
techniques) that maintaining a stable gaze is easier. The very presence of a gravitational well could 
well be a “double-edged sword” in the sense that it becomes easier to maintain a stable gaze but it 
also causes more incorrect clicks since subsequent targets are also easier to acquire – so much so that 
they are acquired before the verbal command is completely issued or processed.  
3. ETS has significantly more incorrect target acquisitions that ETSM or ETSG.  
4. ETS had roughly double the number of target re-entries compared to incorrect target acquisitions. In 
comparison, ETSM had a high number of target re-entries but much lower incorrect target 
acquisitions. This suggests that participants employed different strategies when attempting to select 
with ETS and ETSM. When using ETS, it would appear that participants prefer to look a distance away 
from the designated target, quite often inadvertently or purposefully acquiring another target. They 
will then look back at the designated target in order to attempt a selection. Conversely, with ETSM 
the high number of target re-entries indicates a high incidence of target slippage but the relatively 
low number of incorrect target acquisitions points to a method of fine-tuning for selection purposes.   
5. ETSG has a significantly faster time to final acquisition but a significantly slower time to selection. 
Overall, the negative impact of the time to selection causes ETSG to be significantly slower than the 
mouse. However, this remains a promising discovery and possible continued practice may increase 
the final acquisition time, as already evidenced in the three sessions conducted. Another alternative is 
to provide a different means of selection, such as look-and-shoot coupled with the press of a 
keyboard key. 
Although designed to alleviate the strain of finely focusing on small targets, the magnification tool required 
perhaps the most concentration and was unnatural for the majority of the participants. This could perhaps be 
the reason behind its poor performance against the other interaction techniques. The swift reaction of the eye 
gaze when employing the gravitational well could be expected by the participants as people are accustomed to 
rapid focusing. Additionally, the presence of peripheral vision together with the use of a physical interaction 
107 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
technique negates the need for prolonged and finely tuned focusing under normal circumstances. The higher 
performance is undoubtedly directly related to the fact that the selectable area is much larger than with the 
other interaction techniques and it facilitates a smoother selection regardless of the stability of the eye gaze. 
Since users are not aware of their fine eye movements, the gravitational well is perhaps the interaction 
technique which most closely resembles the expectations of the user in terms of their perceived behaviour.  
The gravitational well also inspires more confidence in the users as they are unaware of the larger selectable 
area but they are achieving the desired results with minimal effort. It also allows for a more aesthetically 
pleasing interface as the widgets are kept to a smaller size, although they must be spaced further apart to 
make provision for the gravitational well. 
These findings confirm to an extent previous findings (Ashmore et al., 2005) in the sense that omnipresent 
magnification does not perform as well as other pointing techniques. The GHA fisheye lens used in the study of 
Ashmore et al. (2005) also requires that the user fine-tune the selection of the target within a magnified area. 
However, this still facilitates better pointing than an omnipresent fisheye lens. The reason for this and for the 
performance of ETSM could be the disruption of the visual search caused by the omnipresent magnification. 
The current study’s results also confirm those of Ashmore et al. (2005) that omnipresent magnification and no 
magnification have equivalent selection times for eye gaze. 
Incorrect clicks were experienced with all eye gaze interaction techniques although more so with ETSG. 
Nevertheless, this finding corresponds with the finding of Kaur et al. (2003) that the target which was acquired 
a certain amount of time prior to command execution, is the target that must be selected. Although the 
interval was found to be 630 milliseconds (Kaur et al. 2003) this interval will have to be confirmed for use with 
eye gaze and speech. While natural eye gaze movement appears to dictate that the target prior to command 
utterance must be selected, it must still be determined whether this will appear natural to the user or whether 
they would prefer to adapt to the use of ETSG as it was tested in this study. Clearly, practice allows them to 
adjust their natural behaviour to a degree to compensate for the interaction technique as is evidenced by the 
improvement over the sessions. However, requiring users to change their natural behaviour is not the aim of a 
multimodal interface. Therefore, it becomes necessary to establish the interval required for target selection 
and test the usability of that compared to the standard gravitational well employed in this study. 
Previous studies such as the touch sensitive mouse and MAGIC pointing warped the mouse pointer to the 
position of the eye gaze and then users were required to use the mouse pointer to click on the desired target. 
Although this exploits the high speed of eye gaze and also reduces incidences of incorrect clicks since users are 
not likely to click on the incorrect target when having to manually manipulate a mouse pointer, some physical 
dexterity is required. The solution may lie in a combination of this technique and speech. Eye gaze could be 
used to establish intent, a single voice command could be issued to warp the pointer to the selectable target 
closest to the current eye gaze and once the user has verified that the correct target is acquired, a second 
command can be issued to click on the target. For fine-tuning purposes of the mouse cursor, direction- or 
target-based navigation can also be provided. 
In terms of comparison with previous studies, not many previous studies compared eye gaze selection with the 
ISO test and certainly none on this scale. The closest comparison would be with the look-and-shoot tests since 
eye gaze and speech could be considered look-and-shoot. Throughput for ETSG was much lower (2.31 bps) 
than the look-and-shoot using the space bar (3.78 bps). The accuracy of the speech engine could have played a 
significant role in this instance and it may be worthwhile investigating this supposition using a Wizard of Oz 
study to determine whether it can compete with dwell time and using look-and-shoot with a relatively error 
free activation mechanism such as a key press. In terms of selection time, ETSG averaged approximately 1000 
ms while acquisition time was approximately 500 ms in the third session of the ISO test. Using the ISO test it 
was suggested that a dwell time of 500 ms (Zhang & MacKenzie, 2007) was the most appropriate. If one 
assumes that target acquisition will be similar then the speech takes double the time of using the dwell time. It 
108 
 
Chapter 5 Analysis of Eye Gaze and Speech as a Pointing Device 
can therefore, be concluded that using speech may be less efficient than using dwell time although studies 
must be conducted to verify this.  
 
5.11 Further research 
Further research can be conducted for interaction techniques using the ISO pointing device test. Future 
experimental setups will exclude different feedback techniques and rather concentrate on changing the 
distance between targets and the size of targets. In this way, more trials with a single interaction technique 
can be added to each session without extending the time required for the session. This may also yield 
interesting results since more measurements per interaction technique will be available. The magnification can 
also be excluded since it clearly does not yield better throughput, increased speed or fewer target re-entries or 
other factors. More sessions may provide deeper insight into the effects of learning, particularly on ETSG. 
Additionally, other selection means may be considered as a way to counteract the significantly slower time to 
selection of the speech commands. 
A more thorough examination of subjective satisfaction can also be made by requiring the device assessment 
to be completed for multiple interaction techniques. Furthermore, subjective satisfaction could be measured 
after the first exposure and then again after the final exposure in order to determine whether there is a shift in 
satisfaction after prolonged usage of the pointing devices. 
Additionally, the results obtained could be specific to the eye-tracker used and results of ETS could be 
significantly different if an eye-tracker with higher accuracy and precision was used. Similarly, the gravitational 
well could be rendered superfluous under these conditions. Further research can be conducted whereby 
different eye-trackers are compared with one another in this regard. 
 
5.12 Summary 
This chapter reported on the analysis of the ISO testing which was conducted with the mouse, eye gaze and 
speech, eye gaze and speech with a gravitational well and also with magnification. Overall it was found that 
the visual feedback does not impact on performance measures in any way. Furthermore, the mouse remained 
the most effective means of point-and-clicking. However, the use of a gravitational well significantly increased 
the performance measures which can be achieved with eye gaze and speech, particularly in terms of time to 
acquire the designated target. Overall, the use of a gravitational well appears to provide a promising means of 
increasing the performance of eye gaze and speech as a pointing device. Conversely, the use of a magnification 
tool appears to hamper the performance of eye gaze and speech significantly since it does not provide enough 
stability. It remains to be seen, however, whether a more extended use of the interaction techniques can 
further increase the use of these interaction techniques or whether they will eventually reach a plateau of 
performance levels. The following chapter will report on the results of testing eye gaze and speech in 
Microsoft Word®. 
 
  
109 
 
 
CHAPTER 6 
ANALYSIS OF SPEECH COMMANDS IN WORD 
 
6.1 Introduction 
The previous chapter explored the possibility of eye gaze and speech as a replacement for the mouse as a 
pointing device. It was found that when using a gravitational well, it was possible to achieve improved 
performance with eye gaze and speech. Additionally, the use of magnification did not enhance the use of eye 
gaze and speech as a pointing device. This chapter will focus on the use of a speech interaction technique 
within a word processing application. In particular, a number of tasks will be compared when using the 
traditional means of input in a word processor and when using speech as an alternative. Since the interaction 
techniques will be used in a well-known application, the environment will be familiar to the participants and 
the feasibility of using speech to complete tasks can be investigated thoroughly. A number of usability 
measurements will be identified and analysed and the results of the analysis will be discussed in detail. 
 
6.2 Procedure 
The longitudinal testing was conducted over a ten week period. For the purposes of this thesis, longitudinal 
testing refers to the fact that a series of tests were repeated and conducted over a period of time. Each 
participant attended one session per week at the same time and on the same day of the week. There were 
isolated incidents where the participant could not attend his/her scheduled session for various reasons and 
he/she was then accommodated in another session, which was not too close to his/her next session and not 
too far from his/her previous session. Of course, as students are prone to do, some did not make alternative 
arrangements and simply did not attend some of their sessions. Therefore, there were some weeks where not 
all 25 students participated. The students were paid a cash incentive for each session that they attended. 
During the first session, participants completed the pre-test questionnaire (Appendix F). Following this, they 
each trained their speech profile using the Microsoft speech training wizard. The training wizard requires that 
the user read a large amount of pre-defined text. The wizard then attempts to recognise the spoken words and 
in so doing, build a profile for the reader based on their pronunciation and enunciation of the words. The 
participants were then introduced to the multimodal Word that they would be using for the next few weeks. 
They were also given a brief tutorial of the speech grammar which was available for use in Word (Table 3.1). 
The participants were then encouraged to interact with the application and to use all the verbal commands as 
well as attempting to type a full sentence using the onscreen keyboard and the interaction technique of eye 
gaze and speech. Once they were comfortable with the application, they were given an explanation regarding 
what the next few sessions would consist of. To conclude the first session, each participant completed the 
post-test questionnaire as shown in Appendix G. 
Every subsequent session had the same procedure which was followed, which was to complete the tasks as set 
out in Section 3.4.3.3 (Table 3.5). Each individual task was displayed to the participant using a small window 
overlaid on the word processor. The window did not obstruct any part of the word processor. The participant 
had to complete as many tasks as they could manage in their half hour slot. After completion of their tenth 
session, all participants completed the more comprehensive post-test questionnaire (Appendix H). Therefore, 
according to this setup each participant could complete a maximum of nine tests with the application. 
 
110 
 
Chapter 6 Analysis of Speech Commands in Word 
6.3 Participants 
In total there were 25 students who participated in the longitudinal study. They were all undergraduate 
students who were completing their studies at the University of the Free State. Therefore, the sampling 
technique used for this study was convenience sampling. A pre-requisite for participation in the study was 
sufficient computer literacy as well as word processor expertise. Forty percent of the sample was drawn from 
second year Computer Science students who were registered for a community service module. The other 60% 
of the sample was drawn from the student assistants for the computer literacy course of the university, with 
the proviso that they were not studied for a computer science or related degree. These students all had to 
complete the literacy course prior to becoming an assistant and they had to achieve at least 70% for a 
competency test of Microsoft Office applications.  
In order to determine a measurement of expertise with Microsoft Word®, the question pertaining to the 
duration that the participants had used Word and the frequency with which they used Word were each 
measured on a scale of 0-4. Then the responses to these two questions were multiplied to get a measurement 
on a scale of 0-16. This scale was then viewed as a measurement of expertise. The expertise rating of each 
participant was calculated and it was found that there were 3 participants with low Word expertise (scale 
rating from 0 until 6), 3 with average Word expertise (scale rating from 7 until 10) and 17 with high Word 
expertise (scale rating above 10). Since there were other measures in place to confirm their expertise with a 
word processor, namely their qualification to serve as assistants, all participants were accepted into the study 
as competent Word users. 
There were 17 male participants and 8 female participants and the average age of the participants was 21.1 
(standard deviation = 1.9). Six participants indicated that English was their first language, 7 Afrikaans and the 
remainder (12) were African language speakers. Since the university employs a parallel medium tuition policy 
where classes are offered in either English or Afrikaans, all students were comfortable in either English or 
Afrikaans. Each session was conducted in the tuition language in which the participant was most comfortable. 
Only four participants used keyboard shortcuts while working in Word and 17 preferred using the mouse 
rather than the keyboard to complete tasks. Only one participant, who was a Computer Science student, had 
had exposure to the eye-tracker but this was in the capacity of using it for research purposes. Therefore, while 
he was at ease with the technology he had not used it as an input technique. Five participants had used speech 
recognition before but only as a dictation tool. 
 
6.4 Tasks 
The task list (Table 3.5) had a total of 20 tasks, five of which were typing tasks (phrases to be typed were 
randomly chosen from the phrase set specified in section 3.4.3.3). Three of these typing tasks had to be 
completed using the onscreen keyboard with eye gaze and speech as an interaction technique. The majority of 
the other types of task, for example selection and formatting, had to be completed using the traditional means 
of a mouse or keyboard. A similar task then had to be repeated using speech recognition. The tasks were set 
up in such a way that the same types approximately required an equal number of minimum actions to 
complete successfully. This task list remained the same for the first four sessions. Thereafter, an additional 5 
typing tasks (Table 6.1) were added to the end of the task list. These typing tasks all had to be completed using 
the onscreen keyboard. However, the size and spacing of the keys were adjusted in order to test the effect of 
different spacing and button sizes on typing. By the fourth session most participants were able to complete the 
original 20 tasks in less than their scheduled half an hour. Since participants were not pressured to complete 
all tasks but rather to complete as many as possible and as accurately as possible within their allotted time, the 
additional tasks placed no further pressure on them to complete more tasks. If their 30 minutes expired and 
111 
 
Chapter 6 Analysis of Speech Commands in Word 
they had not yet completed all tasks, they were simply allowed to finish the task they were busy with and then 
the test ended.  
The first additional typing task used the same settings as the original tasks. The next two used buttons that 
were 5 pixels smaller in both height and width, but were 10 pixels further apart. The final two tasks used 
buttons that were reduced a further 5 pixels in width and height but which were spaced the same as for the 
original typing tasks. These tasks were added in response to requests from participants that they be permitted 
to try smaller buttons for the typing tasks. The additional typing tasks were preceded by a new task that 
required the participant to remove all the text from the document. Consequently, the final five sessions had 26 
tasks. The setup of the typing tasks will be discussed in greater detail in the next chapter. 
The tasks could be grouped as follows: 
Table 6.1: Task description and grouping 
Task  Task text Task type 
22 Enter the following phrase using eye gaze and speech Typing with original settings 
recognition: <randomly selected phrase> 
23 Enter the following phrase using eye gaze and speech Typing with slightly smaller buttons 
recognition: <randomly selected phrase> but spaced further apart 
24 Enter the following phrase using eye gaze and speech Typing with slightly smaller buttons 
recognition: <randomly selected phrase> but spaced further apart 
25 Enter the following phrase using eye gaze and speech Typing with slightly smaller buttons 
recognition: <randomly selected phrase> but spaced the same as original 
buttons 
26 Enter the following phrase using eye gaze and speech Typing with slightly smaller buttons 
recognition: <randomly selected phrase> but spaced the same as original 
buttons 
 
A more succinct summary of the tasks is tabulated below. Since the next chapter will focus on the typing tasks 
and this chapter only on the other tasks, the typing tasks will be omitted for the time being: 
 Table 6.2: Grouped tasks as divided between interaction techniques 
Task type Keyboard Speech 
Line selection and formatting 1 1 
Select all text and remove 1 2 
Select words and format 1 1 
Paste 1 1 
Undo 1 1 
Select word and copy 1 1 
Position and paste 1 1 
Select all and format  1 
 
This chapter will concentrate on analysing each of these task types individually for usability and learnability. 
The typing tasks will be analysed in the following chapter.  
 
6.5 Measurements 
As discussed and defined in section 2.3, the four pillars of usability which are of interest in this study are 
effectiveness, efficiency, learnability and satisfaction. Measurements of efficiency, effectiveness and 
112 
 
Chapter 6 Analysis of Speech Commands in Word 
learnability of the speech commands will be discussed in this chapter. The subjective satisfaction will be 
analysed and reported on in Chapter 8. 
The efficiency measurements that will be analysed are the time taken to complete the task and the number of 
actions that were required to complete the task. Additionally, the effectiveness measurement of the 
percentage of the task completed correctly by the participants will also be evaluated. The number of errors 
was also considered as a means to determine how effective the interaction technique is. However, since there 
are multiple ways to complete a task, it became very difficult to pinpoint exactly what was an erroneous 
action, particularly where the mouse or keyboard was used. For the speech, the commands that could 
complete the task could be isolated as an acceptable set of commands for that task and then any command 
issued that was not a member of that set could be flagged as an error command. However, since there is 
considerable risk for potentially flagging an action as an error when it might not be, it was decided that the 
number of actions to complete and the percentage of the task completed correctly were better indicators of 
the effectiveness of the interaction techniques.  
 
6.6 Limitations of this study 
A limitation of the study was the sensitivity of the speech recognition which was prone to pick up ambient 
noise and react to it. This meant that although the participants completed the task correctly, the result was 
not always as expected. For example, noises picked up by the microphone might cause the cursor to move to 
an unexpected position, resulting in erroneous input. However, participants soon learnt that this was the case 
and learnt to compensate for these shortcomings somewhat. This limitation also has the associated advantage 
of emulating a real world environment, therefore, it was not considered to be detriment to the study.  
The fact that the testing was conducted in a controlled environment could also be considered a limitation as 
testing was not conducted in a variety of possible environments in which it could be used. 
Furthermore, there was only a subset of commands catered for in the grammar. Since the aim of the study was 
to test speech recognition to provide a more hands-free environment, only the basic tasks were provided for in 
the grammar. It should not be a problem to extend the grammar to include more commands. The only 
shortcoming this created was that the grammar may have been easier to memorise than a longer one. Since 
there were only ten weeks in which the participants could learn the grammar, this was not felt to be a major 
drawback as most participants had been using Word in the traditional sense for a number of years already. 
 
6.7 Task analysis 
 
6.7.1 Line selection and formatting 
These tasks required that the participant select three consecutive lines of text and then apply formatting. The 
minimum number of actions required for both of these tasks was 4.  
 
6.7.1.1 Time to complete task 
The time to complete the task was measured from when the task was started to when the task was considered 
by the participant to be completed. This time included the time it took the participant to read the description 
of the task. Since similar tasks had virtually identical wording it was assumed that they would require the same 
amount of time to read and that, therefore, the time to read would not have an effect on the time required to 
complete the task.  
113 
 
Chapter 6 Analysis of Speech Commands in Word 
Time to complete the task was measured in seconds and Table 6.3 summarises the number of participants 
(first line of each row), the mean seconds required to complete the task (second line of each row) and 
standard deviation (third line of each row) of the time required to complete the selection and formatting tasks. 
 Table 6.3: Descriptive statistics for time to complete line selection and formatting 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 25 25 12 13 
x̄  76.2 37.3 84.2 40.4 
s 44.7 17.7 51.8 20.7 
Session 3 n 23 23 12 13 
x̄  36.9 25.5 32.0 26.2 
s 20.7 8.0 19.2 7.4 
Session 4 n 24 24 12 13 
x̄  34.9 30.5 31.7 32.4 
s 18.4 26.2 17.0 31.6 
Session 5 n 23 23 12 13 
x̄  29.1 27.0 25.1 28.6 
s 21.0 23.6 21.3 28.6 
Session 6 n 23 23 12 13 
x̄  26.8 24.5 18.0 22.3 
s 22.6 14.1 3.2 9.2 
Session 7 n 22 22 12 13 
x̄  23.6 21.1 19.1 22.5 
s 17.0 9.7 9.1 9.9 
Session 8 n 20 20 12 13 
x̄  16.5 19.3 16.3 20.9 
s 4.4 8.4 4.6 9.5 
Session 9 n 22 22 12 13 
x̄  27.6 18.1 22.2 16.3 
s 19.9 7.0 10.8 6.4 
Session 10 n 24 24 12 13 
x̄  21.0 17.7 16.6 17.2 
s 13.5 8.2 4.9 7.9 
 
From Table 6.3 it can be seen that the times required to complete the two tasks are more or less the same 
over the majority of the sessions. It was only during the second session that the time for the speech task was 
much higher than the keyboard task. This could easily be attributed to the fact that the participants had not 
yet mastered the speech recognition and still had to consult the handout to determine which commands were 
needed. In the weeks thereafter, the recall of the commands could have been easier. In the eighth week, the 
speech task was, on average, completed even faster than with the keyboard. With the exception of the ninth 
and tenth week (and the fourth week for the keyboard), the times for both tasks steadily lessened. The fact 
that the time lessened with each session indicates that the commands are learnable and memorable as more 
exposure to the commands facilitated quicker completion of tasks. 
Chart 6.1 plots the means for both interaction techniques over all sessions. The vertical bars denote a 95% 
confidence interval. 
The time measurements were in seconds and there were a vast number of instances in which the normality 
tests failed for the data, often for more than one test on that range of data. In order to combat this, the time 
measurement was converted to 1/time and normality tests were again conducted on this data. Although 
114 
 
Chapter 6 Analysis of Speech Commands in Word 
1/time will be used for the analysis of the time, the descriptive statistics and charts will be based on the 
original time data for the sake of clarity. This will apply to all the time analyses in this chapter. The results for 
the normality tests are summarised in Table 6.4. 
 
 
Chart 6.1: Means for completion time of line selection and formatting 
 
Table 6.4: Normality test results from completion time of line selection and formatting 
 Shapiro-Wilks Kolmogorov-Smirnov 
Session 2 W = 0.887, p < 0.05 d = 0.148, p > 0.05 
Session 3 W = 0.921, p < 0.05 d = 0.111, p > 0.05 
Session 4 W = 0.977, p > 0.05 d = 0.089, p > 0.05 
Session 5 W = 0.968, p > 0.05 d = 0.094, p > 0.05 
Session 6 W = 0.940, p > 0.05  d = 0.124, p > 0.05 
Session 7 W = 0.940, p > 0.05 d = 0.124, p > 0.05 
Session 8 W = 0.973, p > 0.05 d = 0.111, p > 0.05 
Session 9 W = 0.896, p > 0.05 d = 0.113, p > 0.05 
Session 10 W = 0.895, p > 0.05 d = 0.110, p > 0.05 
 
According to the Shapiro-Wilks test, the time measurements of sessions 1 and 2 are not normally distributed. 
Owing to the robustness of the ANOVA when the data is not normally distributed, all subsequent analyses will 
always include the tests for normality. For the sake of conciseness, however, the individual results will not be 
reported.  
2
The assumption of sphericity was also violated (χ (35) = 68.969, p < 0.05), therefore the adjusted corrections 
(StatSoft, 2010) will also be reported. The following hypotheses were tested for this analysis: 
1. H0,1: There is no difference between the time required to complete the tasks when using the mouse 
and keyboard or speech commands. 
2. H0,2: The practice obtained over the sessions has no effect on the time taken to complete the tasks. 
The repeated measures ANOVA yielded the result of not rejecting the first null hypothesis at an α-level of 0.05 
(F(1, 23) = 0.286, p > 0.05). Therefore, it can be concluded that there is no difference between the time 
required to complete line selection and formatting when using the different interaction techniques. Therefore, 
115 
 
Chapter 6 Analysis of Speech Commands in Word 
using speech commands is just as fast as using the mouse and keyboard. Moreover, this indicates that the use 
of speech commands is a viable alternative to the mouse and keyboard, in terms of time required. 
H0,2 could be rejected (F(8, 184) = 14.040, p < 0.05) which indicates that practice significantly affects the time 
required to complete the task. The interaction between the two factors of session and interaction technique 
was not significant (F(8, 184) = 1.722, p > 0.05). The results of the adjusted univariate results as well as the 
multivariate tests are shown in the table below: 
 Table 6.5: ANOVA results for the completion time of line selection and formatting 
 Geisser-Greenhouse Huyn-Feldt Multivariate 
Session F(4.0, 92.2) = 14.040, F(5.1, 118.9) = 14.040, F(8, 16) = 13.292, 
p < 0.05 p < 0.05 p < 0.05 
Session × Interaction F(1, 92.2) = 1.722,  F(5.1, 118.9) = 1.722, F(8, 16) = 1.255,  
technique p > 0.05 p > 0.05 p > 0.05 
  
Tukey’s HSD was used for post-hoc analysis to establish which sessions were responsible for the significant 
difference. Session 2 differed significantly from all other sessions. Session 3 differed significantly from session 
6 as well as from sessions 8, 9 and 10. Session 4 differed significantly from session’s 8, 9 and 10. Since session 2 
was actually the first session where the tasks had to be completed, the reason for the longer time could be 
that the participants were not familiar with the tasks and the process of the task completion. As the 
participants became accustomed to the session requirements, the time lessened. Therefore, the improvement 
from the second to the third session and the subsequent performance measured was significant. The 
improvement between sessions 3 and 4 and the last sessions was also significant. This indicates a significant 
level of learning to use the system such that performance is comparable to traditional interaction techniques. 
 
6.7.1.2 Number of actions  
The next measurement to be analysed was the number of actions that were performed during task 
completion. Actions are defined as any mouse click, button press or speech command that was issued during 
completion of the task. The number of actions was measured per interaction technique and per session for 
each participant and then, as for all other analyses, outliers were removed from the data set prior to analysis 
(Section 3.5).  
The table below shows the number of participants whose data was included in the analysis, with the mean of 
the data in the second line of each row and the standard deviation in the third line. 
The descriptive statistics show that the number of actions for the speech interaction technique was very high 
in the first session and then declined sharply during the second session. Thereafter, it decreased steadily until 
session 8 after which it stabilised somewhat around an average of 10. The actions for the keyboard remained 
between a minimum of 10 and maximum of 15 for the majority of the sessions. Apart from the first session 
and session 8, the actions appear to be comparable for the two interaction techniques. Chart 6.2 is a plot of 
the mean number of actions over all the sessions. 
  
116 
 
Chapter 6 Analysis of Speech Commands in Word 
Table 6.6: Descriptive statistics for the number of actions used for line selection and formatting 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 25 23 13 10 
x̄  32.4 11.7 34.9 16.5 
s 23.0 11.4 25.5 15.9 
Session 3 n 23 22 13 10 
x̄  12.3 8.5 8.5 8.1 
s 8.5 9.9 5.7 8.4 
Session 4 n 24 20 13 10 
x̄  14.4 14.1 12.5 20.0 
s 9.6 14.7 7.9 16.0 
Session 5 n 23 22 13 10 
x̄  12.5 13.9 11.2 14.0 
s 9.6 15.7 9.9 14.7 
Session 6 n 24 23 13 10 
x̄  10.2 10.9 7.8 12.7 
s 6.2 9.4 2.0 12.4 
Session 7 n 23 22 13 10 
x̄  9.7 11.9 8.1 12.4 
s 7.1 12.4 3.3 11.9 
Session 8 n 20 19 13 10 
x̄  7.7 15.1 8.1 15.2 
s 2.6 14.6 2.8 12.9 
Session 9 n 21 21 13 10 
x̄  10.8 9.1 10.5 11.3 
s 4.3 7.2 4.0 8.7 
Session 10 n 23 23 13 10 
x̄  9.2 12.5 7.8 10.9 
s 3.8 10.5 2.1 8.4 
 
The descriptive statistics show that the number of actions for the speech interaction technique was very high 
in the first session and then declined sharply during the second session. Thereafter, it decreased steadily until 
session 8 after which it stabilised somewhat around an average of 10. The actions for the keyboard remained 
between a minimum of 10 and maximum of 15 for the majority of the sessions. Apart from the first session 
and session 8, the actions appear to be comparable for the two interaction techniques. Chart 6.2 is a plot of 
the mean number of actions over all the sessions. 
2
The assumption of sphericity was also not met (χ (35) = 137.094, p < 0.05), which will require that adjusted 
corrections be made to the degrees of freedom using the Geisser-Greenhouse and Huyn-Feldt tests (StatSoft, 
2010). The following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the number of actions required to complete the task. 
2. H0,2: There is no difference between the number of actions required to complete the tasks over the 
various sessions. 
 
117 
 
Chapter 6 Analysis of Speech Commands in Word 
 
Chart 6.2: Mean number of actions required to perform line selection and formatting 
At an α-level of 0.05, there is significant interaction between the two factors (F(8, 164) = 4.105, p < 0.05) which 
means that the results of the overall ANOVA cannot be interpreted easily, but that separate ANOVAs must be 
conducted so as to control for each factor individually. Consequently nine separate ANOVAs were performed 
to determine if the interaction technique had a significant effect on the number of actions required to 
complete the task. These results are tabulated below: 
Table 6.7: Results of ANOVA on the number of actions required to perform line selection and formatting 
 ANOVA  
Session 2 F(1, 46) = 15.260, p < 0.05 
Session 3 F(1, 43) = 1.918, p > 0.05 
Session 4 F(1, 42) = 0.008, p > 0.05 
Session 5 F(1, 43) = 0.138, p > 0.05 
Session 6 F(1, 45) = 0.105, p > 0.05 
Session 7 F(1, 43) = 0.540, p > 0.05 
Session 8 F(1, 37) = 4.992, p < 0.05 
Session 9 F(1, 40) = 0.792, p > 0.05 
Session 10 F(1, 44) = 2.006, p > 0.05 
 
Consequently, H0,1 could only be rejected for sessions 2 and 8. During session 2, the keyboard required 
significantly fewer actions to complete the task and in session 8 it required significantly more actions than the 
speech interaction technique. Session 2 was the first session and could be viewed as a learning experience for 
the participants to become suitably accustomed to the speech commands available for use. After this session, 
the actions were reduced to such a degree that the same number of actions could be used as when using the 
keyboard. Combining this with the findings of the time required to complete the task it could be concluded 
that speech and the keyboard are equivalent in terms of efficiency and effectiveness when selecting multiple 
lines and applying formatting to those lines and even have the ability to surpass the keyboard with some 
usability measurements. 
A repeated-measures ANOVA was then conducted to analyse H0,2 for each interaction technique. The 
hypothesis could not be rejected for the keyboard but could be rejected for the speech commands. Tukey’s 
post-hoc analysis was used to determine which sessions differed significantly. It was found that session 2 
differed significantly from all other sessions for speech. During session 2, participants required more actions, 
on average, than in the other sessions. This could indicate that participants were simply unfamiliar with the 
test structure, particularly since these were the first two tasks in the test. At this stage, participants were also 
118 
 
Chapter 6 Analysis of Speech Commands in Word 
unsure of the verbal commands and may have required some time to familiarise themselves with the available 
commands. The fact that the number of actions decreased over the sessions showed that the participants 
were able to learn, and use, the commands to select lines and format more effectively as time went by. 
Since the key presses were, on average, marginally more than the speech commands, the actual keys (which 
were captured and stored real time during completion of the test) that were pressed were examined more 
closely. It was found that for each session the [Right] and [Down] keys were used a large number of times. The 
[End] key was only used during three sessions and only by a single participant during two of these sessions and 
two participants in the other session. This indicates that the majority of the participants either were not aware 
of the shortcut [Control + End] to move to the end of a document or preferred to navigate there using multiple 
key presses or the mouse. The [Right] key was by far pressed the greatest number of times which implies that 
some of the participants selected the lines character by character by holding the [Right] key in. Depending on 
the amount of text to be selected this is by far the most inefficient method of selecting text, particularly when 
whole words or lines must be selected. Since it appears that the majority of the participants used the mouse 
for selection purposes, the fact that there was a minority who employed this very inefficient means was not 
cause for great concern but cognisance was taken thereof. 
 
6.7.1.3 Correctness of task completion 
Each of these tasks featured three distinct components which had to be performed in order for the task to be 
completed correctly, namely: 
1. The participant had to select a portion of text. 
2. The correct text had to be selected. For example, some tasks required the first two words to be 
selected and formatted. If the participant only applied formatting to the first word, then they would 
receive credit for the fact that a selection occurred (number 1) but not for this step as the formatting 
was not applied correctly according to the task specifications. 
3. The correct formatting had to be applied. 
Participants received credit for each component of the task that they completed correctly. Therefore, if the 
participant attempted the task a minimum of zero for each task and a maximum of 3 could be scored. The 
stacked bar graph below shows the number of participants who scored 0, 1, 2 and 3 for each of the tasks and 
for each session. 
The high incidence of a zero count in a large proportion of the categories prevents a meaningful analysis from 
being performed; therefore no statistical inferences will be made from the data. The keyboard had a lower 
task completion rate than the speech. The fact that the keyboard task required participants to select the last 
three lines in the document and the speech task the first three lines was the cause of this difference. All the 
participants who scored 2 for this task lost a mark for selecting the incorrect lines. However, observation 
during task completion showed that the participants did in fact select three lines during the completion of the 
task, they just selected the first three lines of text in the document. Therefore, the lower scores for the 
keyboard are caused, not by the interaction technique per se, but rather by the participants not reading the 
task instructions correctly. If points were awarded less strictly and based on whether any three lines of text 
were selected, all participants with a current score of 2 would score 3 for the task. Consequently, the 
correctness of the two tasks would be identical. This leads to the conclusion that, in terms of the interaction 
technique, there is no difference between the keyboard and speech command with respect to the correctness 
with which the task can be completed. 
 
119 
 
Chapter 6 Analysis of Speech Commands in Word 
25
20
15 19 17 16 16
15 Score 
20
25 21 23 18 23 3
23 24 23 17 2422
10 20 2
1
5
6 6 7 7
8
4 3 0
2 3
0 0 1 0 0 0 0 0 0 1
Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 Session 8 Session 9 Session 10
 
Chart 6.3: Correctness of task - Select lines and format 
Also note that the document that was provided to the participants to work in had only eight lines of text and 
fitted comfortably within the viewing area even when the onscreen keyboard was activated. Therefore, the 
fact that the tasks required lines of text to be selected at the start of the document and then at the end of the 
document was considered trivial in terms of the possible impact it would have on task completion. 
 
6.7.2 Select all text and remo ve
There was a single task which required the participants to select all text and remove it from the document 
when using the keyboard. There were two such tasks to be completed using speech commands. However, the 
second task using speech recognition was simply in place so that the participants could complete the typing 
tasks in a clean document. Therefore, the instruction to remove the text for this task using speech was not 
strictly enforced and for this reason the second of these tasks using speech recognition will not be included in 
the current analysis. 
The keyboard task required the participants to select all the text in the document and to cut it. Alternatively, 
the speech task required that all text must be selected and deleted from the document, which would require 
the commands “Select all” followed by “Remove”. This means that the keyboard task required one more key 
press to complete the task successfully, namely [Ctrl + A] and then [Ctrl + X]. Nevertheless, the end result of 
the tasks is the same and the single extra action should not complicate the use of the keyboard to such an 
extent that the difference will be significant as a result thereof. Since the end result is the same the two tasks 
both fall under the same category of document text selection and removal. 
 
6.7.2.1 Time to complete task 
The time to complete each task was measured in seconds for each session and each participant. The 
2
assumption of sphericity (χ (35) = 47.024, p > 0.05) was met, therefore the data was suitable to be analysed 
using a repeated-measures within-subjects ANOVA. Descriptive statistics of the data are summarised in Table 
6.8, with the first line giving the number of participants who completed the task (outliers have been removed), 
the second the mean and the third the standard deviation. 
120 
 
Number of participants 
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Chapter 6 Analysis of Speech Commands in Word 
Inspection of the means shows that both interaction techniques started with a fairly low completion time and 
then continued to decrease over the following sessions. Eventually, the speech interaction technique could be 
used to complete the task faster than with the keyboard and mouse. Chart 6.4 below gives a visual 
representation of the means for both interaction techniques over all sessions. 
The following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the time required to complete the task. 
2. H0,2: The session in which the task was completed has no effect on the time required to complete the 
task. 
 Table 6.8: Descriptive statistics for completion time of removing all selected text 
  All participants Participants completing all 
sessions 
  Speech Keyboard  Speech Keyboard  
Session 2 n 25 25 12 13 
x̄  28.2 25.7 24.0 20.3 
s 23.0 14.3 20.1 6.9 
Session 3 n 23 23 12 13 
x̄  17.4 17.0 16.1 14.9 
s 10.4 7.1 9.9 5.5 
Session 4 n 24 24 12 13 
x̄  13.2 16.8 12.4 13.2 
s 6.1 14.1 5.5 4.7 
Session 5 n 23 23 12 13 
x̄  13.4 14.7 10.8 12.6 
s 11.1 7.6 9.4 5.1 
Session 6 n 23 24 12 13 
x̄  13.5 12.4 10.6 11.7 
s 11.9 5.1 9.7 4.5 
Session 7 n 21 22 12 13 
x̄  7.5 12.8 8.0 11.9 
s 2.7 7.3 3.5 8.1 
Session 8 n 20 20 12 13 
x̄  10.2 13.1 8.0 11.7 
s 9.7 8.5 2.8 9.4 
Session 9 n 22 22 12 13 
x̄  8.6 12.0 7.7 12.2 
s 4.3 7.3 1.9 8.7 
Session 10 n 24 24 12 13 
x̄  7.3 12.6 6.5 10.9 
s 2.4 7.6 1.8 5.6 
 
H0,1 could be rejected at a significance level of 0.05 (F(1, 23) = 4.328, p < 0.05) and therefore it can be 
concluded that the interaction technique does have a significant effect on the time required to complete the 
task. When using the keyboard or mouse for this task, participants took significantly longer to complete the 
task in the majority of the sessions than when they made use of speech commands. This is evidence of the fact 
that speech can be used to make selection and removal of text more efficiently than the keyboard or mouse. 
Similarly, H0,2 could be rejected at an α-level of 0.05 (F(8, 184) = 15.197, p < 0.05), which indicates the session 
has a significant effect on the time taken to complete the task. Tukey’s HSD test was used to determine which 
sessions differed significantly. It was found that session 2 differed significantly from sessions 5 to 10. Session 3 
differed significantly from sessions 6 to 10 and session 4 from sessions 7 to 10. Sessions 4 and 5 differed 
121 
 
Chapter 6 Analysis of Speech Commands in Word 
significantly from session 10. Similar patterns held for both interaction techniques. Since the later sessions all 
had an average completion time less than the earlier sessions, it could be said that the first sessions took 
significantly longer than the later sessions. This would indicate some measure of learning in using the 
interaction techniques and perhaps the application as a whole. 
 
 
Chart 6.4: Mean plot for completion time of removing all selected text 
 
6.7.2.2 Number of actions 
The number of actions per task was measured for each participant and each session. Table 6.9 shows the 
descriptive statistics for the data. The first row shows the number of participants who were included in the 
analysis, the second row is the mean and the third the standard deviation. Chart 6.5 is a graphical 
representation of the mean number of actions. 
From Table 6.9 it is clear that the keyboard and mouse task required more actions, on average, than the 
speech commands. This could indicate a higher level of efficiency for the speech interaction technique, 
particularly in view of the significant difference in task completion times. To determine whether these 
differences are significant, the following hypotheses were proposed: 
1. H0,1: The interaction technique does not significantly affect the number of actions required to 
complete the task. 
2. H0,2: The number of actions required to complete the tasks does not differ significantly between 
sessions. 
H0,1 could be rejected at an α-level of 0.05 (F(1, 18) = 8.574, p < 0.05), leading to the conclusion that the 
interaction technique had a significant effect on the number of actions required to complete the task. More 
specifically, the speech interaction technique requires significantly fewer actions to complete the task than 
when a keyboard and mouse are used. H0,2 could also be rejected (F(8, 144) = 2.562, p < 0.05) indicating that 
there was a  noticeable change in the number of actions required as exposure to the application was 
2
increased.  Since the data did not meet the assumption of sphericity (χ (35) = 106.449, p < 0.05), to conclude 
the analysis the adjusted corrections and multivariate results are summarised in Table 6.10. 
 
122 
 
Chapter 6 Analysis of Speech Commands in Word 
Table 6.9: Descriptive statistics for the number of actions required to remove all selected text 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 24 25 12 8 
x̄  9.2 11.8 7.4 6.4 
s 9.6 10.6 7.9 3.3 
Session 3 n 23 22 12 8 
x̄  5.2 6.7 4.2 5.8 
s 4.5 3.2 3.1 3.1 
Session 4 n 24 22 12 8 
x̄  4.8 8.2 4.7 10.8 
s 3.5 6.7 3.3 7.1 
Session 5 n 23 22 12 8 
x̄  5.3 8.9 4.5 8.8 
s 4.9 7.4 4.7 7.3 
Session 6 n 22 23 12 8 
x̄  3.6 6.3 2.9 7.1 
s 2.2 3.8 2.3 4.3 
Session 7 n 21 20 12 8 
x̄  3.3 8.4 3.3 4.9 
s 1.2 7.6 1.4 2.4 
Session 8 n 20 20 12 8 
x̄  4.2 8.8 3.8 7.9 
s 2.3 8.3 1.7 8.8 
Session 9 n 22 19 12 8 
x̄  3.2 5.0 2.9 4.0 
s 1.7 3.3 0.9 1.6 
Session 10 n 24 22 12 8 
x̄  3.3 5.9 3.1 5.0 
s 1.3 3.3 1.2 2.1 
 
 
Chart 6.5: Mean plot for the number of actions required to remove all selected text 
 
123 
 
Chapter 6 Analysis of Speech Commands in Word 
 Table 6.10: Analysis results for the number of actions required to remove all selected text 
 Geisser-Greenhouse Huyn-Feldt Multivariate 
Session F(3.5, 62.1) = 2.562,  F(4.6, 82.9) = 2.562,  F(8, 11) = 3.379,  
p < 0.05 p < 0.05 p < 0.05 
Interaction F(3.5, 62.1) =1.441,  F(4.6, 82.9) = 1.441,  F(8, 11) = 1.579,  
technique × Session p > 0.05 p > 0.05 p > 0.05 
  
Once again, closer analysis of the actions for the keyboard task showed a high incidence of key presses for the 
[Right] arrow key during all sessions.  Very few key presses were registered on the [A] key even though the 
combination of [Control + A] will select all the text in the document. This again shows that the participants do 
not use the keyboard shortcuts which are in place to simplify and speed up the use of the application or that 
they are not aware of the shortcuts which can be used. The fact that the participants appeared to select the 
text one character at a time could explain the high number of actions for this task. It would be interesting to 
conduct a study in which participants are coached in the proper use of the shortcuts and then times and 
actions can be measured to complete tasks using the keyboard and speech. This will allow more conclusive 
analysis to be conducted on the difference between the interaction techniques if the shortest method is used 
for both tasks. 
Tukey’s HSD did not highlight any significant differences between sessions but the less conservative Fisher’s 
LSD did. Session 2 differed significantly from sessions 7, 9 and 10 and session 4 differed significantly from 
sessions 6, 7, 9 and 10. Sessions 2 and 4 had the highest average number of actions required to complete the 
tasks which could signify that for some reason the participants struggled more that week than they did in the 
other weeks. Session 2 could be attributed to the first-time use of the application. Apart from sessions 4, 5, 
and 8 all other sessions showed an improvement in or similar performance to the previous session.  
The fact that this task allowed for a more efficient completion time and actions required is promising as it 
implies that there are circumstances under which the use of speech could be more efficient that the traditional 
means of interaction. 
 
6.7.2.3 Correctness of task completion 
The three components of this task that had to be completed correctly were: 
1. The participant had to select a portion of text. 
2. All the text in the document had to be selected.  
3. The selected text had to be removed. 
The chart below gives a stacked bar chart for the number of participants who scored 0, 1, 2 or 3 for either task 
and for all the sessions. 
124 
 
Chapter 6 Analysis of Speech Commands in Word 
30
25
20
Score 
15 324
25 20
23 22 22 23 23 24 24 23 2422 22 23
2
10 21 20 20 1
0
5
1
2 2
0 0 0 0 10 10 0 0 0 0 0 0 0 0 0 0 0 0
Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 Session 8 Session 9 Session 10
 
Chart 6.6: Correctness of task - Select all text and remove 
Once again, the high incidence of no observations in many categories prevented a meaningful statistical 
analysis from being performed. However, it can clearly be seen from Chart 6.6 that the vast majority of the 
participants were able to complete the task 100% correctly from the very first session. There are only isolated 
incidents where this was not the case and it is doubtful that these will cause a significant difference between 
either the interaction techniques or the sessions. Therefore, it can be concluded that the correctness with 
which the task is completed is not affected by either the interaction technique or the session in which the task 
is completed. 
 
6.7.3 Select words and form at
This task required the participants to select the first two words of the current line that they were on and make 
them bold. The task had to be completed once with the speech interaction technique and once with the 
keyboard. 
 
6.7.3.1 Time to complete the task 
The number of observations, the mean and standard deviation of the completion times for each session are 
shown in the first, second and third line of each row respectively in Table 6.11. 
This is the first task where the time for the speech task has a higher average completion time than the 
corresponding keyboard task. However, it could still be possible that this increased completion time is not 
significantly different to that of the keyboard. Chart 6.7 provides a plot of the mean for the interaction 
techniques across all sessions. 
  
125 
 
Number of participants 
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Chapter 6 Analysis of Speech Commands in Word 
Table 6.11: Descriptive statistics for the completion time of formatting selected words 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 25 25 14 14 
x̄  49.4 29.3 50.1 29.7 
s 25.8 23.5 30.2 27.3 
Session 3 n 23 22 14 14 
x̄  41.8 21.5 42.5 19.3 
s 28.8 13.4 30.2 10.3 
Session 4 n 24 23 14 14 
x̄  31.5 17.6 27.5 16.2 
s 16.2 6.9 14.2 7.0 
Session 5 n 23 22 14 14 
x̄  23.9 19.3 21.1 17.1 
s 10.4 7.7 11.7 7.1 
Session 6 n 23 24 14 14 
x̄  24.9 15.1 25.1 14.4 
s 12.8 3.8 13.6 3.8 
Session 7 n 21 21 14 14 
x̄  25.4 17.4 21.2 16.0 
s 15.5 8.5 11.2 7.7 
Session 8 n 20 20 14 14 
x̄  21.2 14.0 18.3 12.5 
s 12.5 5.8 10.7 3.8 
Session 9 n 22 22 14 14 
x̄  25.0 14.5 20.4 13.3 
s 16.1 6.6 13.5 5.9 
Session 10 n 24 24 14 14 
x̄  28.1 15.6 22.9 13.5 
s 15.8 6.4 15.0 4.9 
 
 
Chart 6.7: Mean plot for completion times of formatting selected words 
 
2
The data did not meet the assumption of sphericity (χ (35) = 53.048, p < 0.05), so an adjusted univariate 
analysis also had to be performed in order to analyse the following hypotheses: 
126 
 
Chapter 6 Analysis of Speech Commands in Word 
1. H0,1: The interaction technique has no effect on the time taken to complete the task. 
2. H0,2: There is no difference between the time taken to complete the task between the different 
sessions. 
These results, together with the general ANOVA and the multivariate results, are summarised in Table 6.12. 
Table 6.12: Analysis results for the completion times of formatting selected text 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 26) = 10.447,     
technique p < 0.05 
Session F(8, 208) = 9.487,  F(4.9, 126.4) = 9.487,  F(6.3, 165.1) = 9.487,  F(8, 19) = 5.707,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(8, 208) = 0.669,  F(4.9, 126.4) = 0.669,  F(6.3, 165.0) = 0.669,  F(8, 19) = 0.952,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
The null hypothesis, H0,1, was rejected at an α-level of 0.05. This means that the interaction technique has a 
significant effect on the time taken to complete the task. Since the time to complete the task using speech 
was, on average, higher for all sessions it can be concluded that using speech commands to select words and 
apply formatting takes significantly longer than using the mouse or keyboard.  
The second null hypothesis of no difference could also be rejected at a significance level of 0.05. Post-hoc tests 
were conducted to determine which sessions differed significantly from one another. Session 2 differed 
significantly from sessions 4 to 10, session 3 differed significantly from sessions 7 to 10 and session 4 differed 
significantly from session 8. The extended completion times could be attributed to the learning curve 
experienced with the verbal commands. Similar to previous tasks, session 2 has the highest time which has 
previously been attributed to inexperience with the application. This may be true for this task as well. There 
was a large improvement in task completion during session 3 but this was not a significant improvement. The 
keyboard also showed marked improvement after the first two sessions.  
This task can be considered more complex than the previous since it requires slightly more difficult verbal 
commands to be issued, in sequence, in order to achieve the desired goal. Selection of a word is accomplished 
through the verbal command of “select word” or “select word back” or alternatively, individual characters can 
be selected by issuing several “shift right” or “shift left” commands. These commands may be less intuitive 
than the previous commands and require more time for becoming accustomed to and remembering. Even so, 
after a number of sessions, participants were able to achieve times that were comparable with the mouse 
and/or keyboard. The best performance with the speech commands was achieved during the eighth session 
after which the time increased again. It can also be surmised that it is possible that participants struggled to 
remember that there were commands available to select a single word at a time and resorted to selecting the 
letters one at a time. Closer inspection of the number of actions and commands issued will provide more 
insight into whether this was the possible reason for the extended time. 
 
6.7.3.2 Number of actions 
Table 6.13 shows the number of participants included in each session for the analysis, followed by the mean 
and finally the standard deviation for each session. 
  
127 
 
Chapter 6 Analysis of Speech Commands in Word 
Table 6.13: Descriptive statistics for the number of actions required to format selected words 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 25 22 14 11 
x̄  15.1 14.9 16.1 17.1 
s 10.7 14.0 12.4 16.0 
Session 3 n 23 23 14 11 
x̄  13.6 15.3 14.1 11.7 
s 10.0 16.8 10.6 10.3 
Session 4 n 24 24 14 11 
x̄  10.4 14.2 8.6 13.5 
s 6.2 11.8 6.2 10.1 
Session 5 n 23 22 14 11 
x̄  9.0 16.8 7.3 17.3 
s 6.1 14.6 4.2 18.7 
Session 6 n 24 24 14 11 
x̄  9.3 13.1 9.6 15.6 
s 5.8 14.2 5.8 15.1 
Session 7 n 23 21 14 11 
x̄  8.7 19.3 7.7 19.1 
s 6.1 20.1 5.2 20.1 
Session 8 n 20 19 14 11 
x̄  8.1 9.5 7.4 8.4 
s 5.2 7.8 5.5 5.1 
Session 9 n 22 22 14 11 
x̄  9.7 14.0 7.0 11.4 
s 6.8 12.0 3.6 9.3 
Session 10 n 24 24 14 11 
x̄  10.5 14.2 8.8 12.6 
s 6.1 11.8 5.5 11.7 
 
 
Chart 6.8: Mean plot for the number of actions required to format selected words 
According to the table above, the number of actions required to complete the task using the keyboard or 
mouse was, on average, much higher than the number required when using the speech commands after the 
third session. This is contrary to one of the explanations offered for the difference in the times required to 
128 
 
Chapter 6 Analysis of Speech Commands in Word 
complete the tasks. Chart 6.8 gives a visual representation of the mean number of actions for the interaction 
techniques over all the sessions. 
2
The assumption of sphericity was not met (χ (35) = 93.477, p < 0.05); therefore the adjusted corrections will 
also be reported. The following hypotheses were investigated: 
1. H0,1: The interaction technique does not have a significant impact on the number of actions required 
to complete the task. 
2. H0,2: There is no significant difference between the number of actions per session. 
The following table summarises all the results for the repeated-measures within-subjects ANOVA. 
Table 6.14: Analysis results for the number of actions required to format selected words 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 23) = 2.598,     
technique p > 0.05 
Session F(8, 184) = 2.234,  F(4.1, 94.3) = 2.234,  F(5.3, 122.2) = 2.234,  F(8, 16) = 3.300,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(8, 184) = 1.646,  F(4.1, 94.3) = 1.646,  F(5.3, 122.2) = 1.646,  F(8, 16) = 1.546,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
From the table it can be concluded that H0,1 could not be rejected but H0,2 could be rejected. This means that 
while the interaction technique does not have an impact on the number of actions required to complete the 
task, the session does. Post-hoc tests indicate that session 2 differed significantly from sessions 8 and 9. Even 
though there was some learning experienced, as indicated by the decrease in the number of actions for all 
sessions, the improvement was not significant from one session to the next. However, the level of 
improvement resulted in the performance in sessions 8 and 9 being significantly better than for the second 
session. 
Since the speech interaction technique took significantly longer to complete the tasks, it was assumed that this 
could mean that more actions were required to complete the task. However, this analysis shows that there is 
no significant difference between the actions performed during the task even though the number of actions 
for the speech was, on average, less than that for the keyboard. It seems counterintuitive that two measures 
of efficiency could yield such contradictory results. The question now arises – how can the speech be 
significantly slower but require fewer actions, albeit not significantly fewer? A reasonable explanation for this 
could be that there are longer pauses between the actions performed or that the actions performed required 
more time to complete. The following section will investigate this supposition. 
 
6.7.3.3 Average time between actions 
The fact that speech commands result in a significantly longer time to complete the task but required fewer 
actions may seem contradictory. It was however, inferred that this meant that although there were fewer 
actions required, each action required more time than did those for the keyboard. Therefore, in an effort to 
explain the apparent discrepancy, the average time between actions was measured for both of these tasks. 
This time does not include the time from when the task started to when the first action is performed but 
rather measures only the difference between performed actions. The mean difference between actions and 
the standard deviation per session can be seen in Table 6.15. 
129 
 
Chapter 6 Analysis of Speech Commands in Word 
 Table 6.15: Descriptive statistics for the time difference between actions 
  All participants Participants completing 
all sessions 
  Speech Keyboard  Speech Keyboard  
Session 2 x̄  3.364 1.466 3.119 1.405 
s 1.437 1.350 0.811 1.320 
Session 3 x̄  3.215 1.549 2.972 1.418 
s 1.411 1.271 1.199 1.226 
Session 4 x̄  2.640 1.195 2.586 1.132 
s 0.747 0.986 0.657 1.196 
Session 5 x̄  2.672 1.608 2.457 1.741 
s 1.091 0.815 0.951 2.244 
Session 6 x̄  2.480 1.323 2.136 1.189 
s 0.788 1.072 0.369 1.164 
Session 7 x̄  2.233 1.452 2.176 1.060 
s 0.674 1.434 0.793 1.461 
Session 8 x̄  2.371 1.085 2.398 0.833 
s 0.633 0.774 0.745 0.519 
Session 9 x̄  2.372 1.214 2.350 0.918 
s 0.624 1.165 0.567 0.896 
Session 10 x̄  2.418 1.356 2.206 1.430 
s 0.749 1.522 0.535 1.935 
 
Table 6.15 and Chart 6.9 clearly show that the average time between actions is a great deal less for the 
keyboard and mouse than for the speech commands. It is encouraging to notice that the difference between 
commands for the speech improved with each session, although it appears to stabilise for the speech from the 
seventh session. The following hypotheses were formulated: 
1. H0,1: The interaction technique has no noticeable impact on the average time between actions. 
2. H0,2: There is no noticeable difference between the average time between commands between 
sessions. 
 
 
Chart 6.9: Mean plot for the time difference between actions 
 
130 
 
Chapter 6 Analysis of Speech Commands in Word 
Using a confidence interval of 95%, both H0,1 (F(1, 25) = 22.307, p < 0.05) and H0,2 (F(8, 200) = 2.037, p < 0.05) 
could be rejected (Table 6.14 contains the adjusted corrections and results of the multivariate analysis since 
2
the assumption of sphericity did not hold – χ (35) = 76.009, p < 0.05). This means that the interaction 
technique plays a significant effect on the time elapsed between actions. Actions can be performed in a much 
more rapid sequence when using the keyboard and mouse than when using speech commands. 
 Table 6.16: Analysis results for the time difference between actions 
 Geisser-Greenhouse Huyn-Feldt Multivariate 
Session F(4.6, 116.1) = 2.037,  F(6.1, 151.6) = 2.037,  F(8, 18) = 1.972,  
p < 0.05 p < 0.05 p > 0.05 
Interaction F(4.6, 116.1) = 1.035,  F(6.1, 151.6) = 1.035,  F(8, 18) = 1.972,  
technique × Session p > 0.05 p > 0.05 p > 0.05 
 
This analysis shows that even though the number of actions required to complete the task is comparable for 
the two interaction techniques, the time difference between issuing commands is not negligible and has a 
noticeable impact on the time taken to complete the task. The improvement between the second and third 
sessions and the last four sessions was significant. 
The time difference between the issuing of commands could be attributed to two reasons, namely either the 
physical utterance of the verbal command consumes more time than key presses or quite possibly the 
participants required more time to determine the next command to be issued for the speech than for the 
keyboard. The fact that the time between commands decreased as time went by, points to the second reason 
as it seems implausible that the participants could learn to utter a command faster as a person’s speaking rate 
is an innate human quality. Therefore, it could be assumed that the commands used were less intuitive than 
the prior commands required and could have placed more strain on the memory of the participant. Since the 
time between commands appears to stabilise from session 7 onwards, this could be the first session where the 
time difference is purely because of the time required to issue the command and not recall the command. A 
more prolonged testing period may serve to either substantiate or contradict this statement and further 
research is required to test this statistically. 
6.7.3.4 Correctness of task completion 
This task too had a maximum score of three which could be obtained by completing the following steps 
correctly: 
1. Any portion of text is selected by the participant. 
2. The correct portion of text is selected by the participant. 
3. Bold formatting is applied to the selection. 
The chart below is a stacked bar graph representing the number of participants who scored zero, one, two or 
three for the task. Participants with zero were unable to complete any steps of the tasks correctly. The graph 
shows the results for both interaction techniques as well as all sessions. 
131 
 
Chapter 6 Analysis of Speech Commands in Word 
25
20
19
15 21 18 17 20 Score 
20 22 17 14 21
22 22 19 23 323
21 22
10 19 2
1
5 0
6 3 5 6 4 5
6
2
0 3 2 3
0 0
1 0 10 0 0 0 10 0 0 0 0 0 0 10 0 0 10
Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 Session 8 Session 9 Session 10
 
Chart 6.10: Correctness of task - Select words and apply formatting 
 
Similar to the previous tasks, the greatest majority of participants could complete the task correctly from the 
very first session and irrespective of the interaction technique which was used. There is a slightly higher 
occurrence of participants who did not complete the task completely correctly but once again this is more due 
to the fact the task was not read correctly. Participants did not always select the first two words on the current 
line, thereby causing a decrease in the correctness with which they completed the task. Hence, if the selection 
of the first two words were to be disregarded then virtually all tasks would be completed with 100% 
correctness. 
 
6.7.4 Paste 
These tasks required the participants to paste a previously copied or cut word after the second word on the 
current line that they were on. Therefore, there was some navigation required, followed by a paste. Both of 
the tasks could be achieved in the same minimum number of actions. 
 
6.7.4.1 Time to complete the task 
The number of participants whose data was included in the analysis, the mean in seconds as well as the 
standard deviation of each session and for each interaction technique are summarised in Table 6.17 and the 
chart directly below that plots the means for the data. 
  
  
132 
 
Number of participants 
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Chapter 6 Analysis of Speech Commands in Word 
 Table 6.17: Descriptive statistics for paste time completion 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 23 23 13 13 
x̄  8.6 13.6 8.2 11.9 
s 2.9 4.4 2.2 3.1 
Session 3 n 23 23 13 13 
x̄  9.1 10.6 7.2 9.3 
s 7.0 4.2 2.4 1.8 
Session 4 n 24 24 13 13 
x̄  6.2 10.2 5.6 8.4 
s 2.0 4.7 1.6 3.6 
Session 5 n 23 23 13 13 
x̄  5.6 8.0 5.6 6.7 
s 1.4 3.0 1.9 1.5 
Session 6 n 23 23 13 13 
x̄  5.0 7.8 4.7 7.1 
s 1.0 2.1 0.9 1.9 
Session 7 n 22 22 13 13 
x̄  4.8 8.3 4.3 7.0 
s 1.5 3.8 1.1 2.2 
Session 8 n 20 20 13 13 
x̄  4.6 6.9 4.3 6.0 
s 1.1 2.8 1.2 1.5 
Session 9 n 22 22 13 13 
x̄  4.7 8.7 4.5 8.6 
s 1.7 5.1 0.9 6.5 
Session 10 n 24 24 13 13 
x̄  4.4 9.3 4.2 6.7 
s 1.3 7.7 1.1 2.0 
 
Using Table 6.15 and Chart 6.11 as a reference, it can be seen that, on average, the time to complete the task 
using speech commands was faster than when using the keyboard or mouse. The assumption of sphericity 
2
(χ (35) = 37.242, p > 0.05) was met at a confidence interval of 95%, therefore no adjusted corrections had to 
be applied. The following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the time required to complete the task. 
2. H0,2: The session in which the task was completed has no effect on the time required to complete the 
task. 
When a repeated-measure within-subjects ANOVA was performed, it was found that there was significant 
interaction between the two factors of session and interaction technique (F(8, 192) = 2.356, p < 0.05). 
Therefore, it was imperative that each factor be analysed in isolation to preclude the interaction with the 
other factor having an effect on the analysis.  
 
133 
 
Chapter 6 Analysis of Speech Commands in Word 
 
Chart 6.11: Mean plot for the paste time completion 
Firstly, H0,1 was evaluated by isolating each session individually and testing for a difference between 
interaction techniques. For brevity’s sake, the actual results of the ANOVA will not be reported here. Suffice it 
to say that, at an α-level of 0.05, there was a significant difference between the interaction techniques in every 
session. Therefore, the completion time is significantly better for speech than for the keyboard and mouse 
throughout all the sessions. 
Secondly, H0,2 was evaluated using a repeated-measures within-subject ANOVA but testing each interaction 
technique separately. Consequently, it was found that H0,2 could be rejected for both the speech interaction 
technique (F(8, 96) = 17.727, p < 0.05) and the keyboard and mouse (F(8, 96) = 6.883, p < 0.05). For the speech 
interaction technique, post-hoc tests indicated that there was a significant difference between the times of 
session 2 and sessions 4 to 10, as well as between session 3 and sessions 6 to 10 and between session 4 and 
session 7, 8 and 10. Similarly, there was a significant difference between session 5 and sessions 7, 8, and 10. 
These results indicated that session 2 could be viewed as a simple practice run to allow participants to become 
accustomed to the application and the appropriate use of the speech commands. From then onwards there 
was improvement in the times achieved to complete the task to such an extent that from session 4 onwards 
there was a significantly better completion rate. Over the subsequent three sessions there was constant 
improvement to the extent that they were significantly slower than the final sessions. From session 6 onwards, 
there was still minor improvement but not to an extent that times differed significantly. 
Post-hoc tests for the keyboard and mouse showed that session 2 differed from session 4 to 10 and session 3 
differed significantly from sessions 8 and 10. 
 
6.7.4.2 Number of actions 
Descriptive statistics for the number of actions are given in Table 6.18. The number of participants whose data 
was included in the analysis, after the removal of the outliers, is shown in the first line of each row in Table 
6.18. Following this is the mean for that session and then the standard deviation in the second and third row 
respectively. 
  
134 
 
Chapter 6 Analysis of Speech Commands in Word 
 Table 6.18: Descriptive statistics for the number of actions to complete a paste 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 22 18 12 5 
x̄  1.9 2.4 1.8 2.0 
s 0.8 0.8 0.8 0 
Session 3 n 22 20 12 5 
x̄  2.9 2.3 2.8 2.2 
s 1.5 0.6 1.5 0.4 
Session 4 n 24 23 12 5 
x̄  1.8 2.4 1.5 2.0 
s 1.0 0.7 0.7 0 
Session 5 n 23 22 12 5 
x̄  1.6 2.4 1.8 2.2 
s 1.0 0.9 1.4 0.4 
Session 6 n 22 21 12 5 
x̄  1.5 2.1 1.4 2.0 
s 0.9 0.3 0.7 0 
Session 7 n 22 19 12 5 
x̄  1.7 2.6 1.9 2.2 
s 0.9 0.8 1.2 0.4 
Session 8 n 20 19 12 5 
x̄  1.6 2.5 1.4 2.2 
s 0.8 0.9 0.8 0.4 
Session 9 n 22 21 12 5 
x̄  1.7 2.3 1.4 2.2 
s 0.9 0.6 0.7 0.4 
Session 10 n 24 20 12 5 
x̄  1.6 2.3 1.5 2.8 
s 1.0 0.6 0.7 0.8 
 
Chart 6.12 gives a visual presentation of the mean number of actions per interaction technique for each 
session. 
 
Chart 6.12: Mean plot for the number of actions to complete the paste 
 
135 
 
Chapter 6 Analysis of Speech Commands in Word 
The minimum number of actions to complete the tasks was 1 for speech (“Paste”) and 2 when using the 
keyboard ([Control + V]) and mouse; therefore if the participants used the most efficient method for each task 
the number of actions should be approximately the same. From Chart 6.9 it can be seen that the number of 
actions when using the keyboard and mouse is consistently higher than when using the speech commands for 
all sessions except for session 3. The mean for the speech indicates that the majority of the participants could 
use the most effective means when using speech commands. However, the same cannot be said for the 
keyboard and mouse. 
The following hypotheses were formulated: 
1. H0,1: The interaction technique does not have a significant impact on the number of actions required 
to complete the task. 
2. H0,2: There is no significant difference between the number of actions per session. 
2
The assumption of sphericity (χ (35) = 66.827, p < 0.05) was not met, therefore an adjusted correction analysis 
was required (Table 6.19). The repeated-measures within-subjects ANOVA allowed for H0,1 to be rejected (F(1, 
15) = 6.287, p < 0.05) at an α-level of 0.05. Since the speech commands required less actions in all sessions 
(apart from session 3), it can be concluded that the keyboard and mouse required significantly more 
commands to complete the task than the speech. One observation that was made during data capturing was 
that many participants used a right click to show the context menu and then clicked on paste. The paste 
command is normally the third item on the menu. However, if there is a spelling or grammatical error, the 
paste command moves to the last item on the menu. Very often it was the case that where the paste was to 
occur there was an error in the document. This led to the participants’ not seeing the paste option at the very 
end of the menu as they were not accustomed to this. The participants would then repeatedly right click the 
menu in an attempt to get a menu that they recognise, not realising that the paste command was in fact 
available. This could have significantly increased the number of actions performed by the participants when 
using the mouse. The fact that this behaviour was observed throughout the nine sessions also indicates that 
the participants did not learn that the paste option shifts to the end of the menu to accommodate corrective 
suggestions to the text. 
The second null hypothesis could not be rejected (F(8, 120) = 1.297, p > 0.05) at an α-level of 0.05. Given that 
the number of actions for the speech was low throughout, this indicates that no learning was required for the 
paste command as it was intuitive enough to accommodate expedited completion times from the very first use 
of the command. Furthermore, it can also be said that no learning occurred when using the keyboard and 
mouse to paste a piece of text which is slightly more worrisome as it would be expected at the competency 
level of the participants that such a minor change in the menu arrangement would be easily noticeable. 
However, since it is doubtful that a paste will occur before the spelling is corrected under normal use; this 
observed phenomenon is perhaps inconsequential within the scope of standard word processing use. 
 Table 6.19: Analysis results for the number of actions to complete the paste task 
 Geisser-Greenhouse Huyn-Feldt Multivariate 
Session F(4.0, 59.3) = 1.297,  F(5.9, 88.5) = 1.297,  F(8, 8) = 2.011,  
p > 0.05 p > 0.05 p > 0.05 
Interaction F(4.0, 59.3) = 1.424,  F(5.9, 88.5) = 1.424,  F(8, 8) = 0.948,  
technique × Session p > 0.05 p > 0.05 p > 0.05 
 
6.7.4.3 Correctness of task completion 
For this task, participants could only score zero or one as the paste had to occur at the precise location of the 
cursor when the task started. Therefore, no positioning was required to complete the task. All participants 
136 
 
Chapter 6 Analysis of Speech Commands in Word 
could complete the tasks correctly from the very first session and with either interaction technique. It was only 
in session 2 and 3 for the keyboard task where a single participant did not perform the paste correctly. 
 
6.7.5 Undo 
The undo tasks required only a single action when using both the speech and the keyboard or mouse and were 
designed to undo the paste of the previous task. The same analysis procedure as for the prior tasks was 
followed for the undo tasks. 
 
6.7.5.1 Time to complete 
The underlying table gives descriptive statistics for the completion time of the undo task. The number of 
observations, mean and standard deviation are listed on the first, second and third row for each session. The 
chart directly following that plots the mean completion times. 
 
 Table 6.20: Descriptive statistics for task completion time for the undo task 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 23 22 13 13 
x̄  11.7 10.6 8.7 9.4 
s 7.3 4.5 3.8 3.8 
Session 3 n 23 23 13 13 
x̄  9.4 8.7 7.5 7.4 
s 8.4 5.1 4.7 3.9 
Session 4 n 24 24 13 13 
x̄  7.7 7.4 5.8 6.3 
s 4.8 4.1 1.6 3.1 
Session 5 n 23 23 13 13 
x̄  5.8 6.6 5.6 6.0 
s 2.6 3.1 3.1 2.6 
Session 6 n 23 23 13 13 
x̄  5.3 5.9 4.7 5.1 
s 1.7 2.1 1.1 2.0 
Session 7 n 22 22 13 13 
x̄  4.6 5.8 4.2 5.3 
s 1.0 2.6 0.7 2.4 
Session 8 n 20 20 13 13 
x̄  4.3 5.3 4.1 4.5 
s 1.1 2.6 0.7 2.1 
Session 9 n 22 22 13 13 
x̄  4.5 5.6 4.2 5.1 
s 1.1 2.6 1.0 2.8 
Session 10 n 24 24 13 13 
x̄  4.6 4.9 4.4 4.3 
s 1.1 1.9 1.0 1.9 
 
137 
 
Chapter 6 Analysis of Speech Commands in Word 
 
Chart 6.13: Mean plot for the completion time of the undo task 
 
This is the first task where the speech and keyboard interaction techniques have almost identical completion 
times. Both interaction techniques exhibited a decrease in completion time as the sessions went by. It is really 
only session 7 and 9 which show a slight increase for the keyboard and 9 and 10 for the speech. However, 
these increases are expected to be non-significant. Furthermore, since the completion times over all sessions 
are approximately the same it is not expected that there will be a significant difference between them. 
However, it is still essential that statistical analysis be performed to verify this as well as to establish whether 
the decrease in completion times, as exhibited by both interaction techniques, is significant. 
2
The assumption of sphericity for the time data was violated (χ (35) = 50.614, p < 0.05), therefore the adjusted 
corrections will also be reported. The following hypotheses were formulated for evaluation of the undo tasks: 
1. H0,1: The interaction technique does not significantly affect the time taken to complete the task. 
2. H0,2: The time taken to complete the task does not differ significantly over the sessions. 
As expected, H0,1 could not be rejected at an α-level of 0.05, which proves that the task can be completed in 
the comparable times regardless of the interaction technique. The second null hypothesis could, however, be 
rejected at an α-level of 0.05 (Table 6.21). Therefore, the sessions differed significantly; in particular, each of 
sessions 2 to 5 differed significantly from all sessions that followed them. This means that, even though the 
improvement between the first sessions was not significant, it eventually allowed the last sessions to be 
significantly better than the first.  
Table 6.21: Analysis results for the completion time of the undo task 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 24) = 0.001,     
technique p > 0.05 
Session F(8, 192) = 22.148, F(5.5, 131.5) = 22.148,  F(7.6, 182.0) = 22.148,  F(8, 17) = 20.036,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(8, 192) = 0.643, F(5.5, 131.5) = 0.643,  F(7.6, 182.0) = 0.643,  F(8, 17) = 0.784,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
138 
 
Chapter 6 Analysis of Speech Commands in Word 
6.7.5.2 Number of actions 
As previously mentioned, the tasks respectively required 1 or 2 actions to be completed with the speech and 
keyboard interaction techniques. Descriptive statistics for the number of actions for this task are contained in 
Table 6.22. 
 Table 6.22: Descriptive statistics for the number of actions to complete the undo task 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 23 22 13 13 
x̄  2.3 2.6 1.8 2.5 
s 1.3 1.9 0.9 1.5 
Session 3 n 23 23 13 13 
x̄  2.5 3.5 2.8 2.6 
s 2.4 3.5 2.3 2.3 
Session 4 n 24 24 13 13 
x̄  1.8 2.6 1.7 1.8 
s 0.6 2.3 0.9 0.4 
Session 5 n 23 22 13 13 
x̄  1.8 1.9 1.9 1.9 
s 0.5 1.1 1.3 0.5 
Session 6 n 23 23 13 13 
x̄  1.9 1.7 1.6 2.0 
s 0.7 0.7 0.5 0.6 
Session 7 n 22 22 13 13 
x̄  1.8 1.3 1.2 1.9 
s 0.6 0.5 0.4 0.5 
Session 8 n 19 20 13 13 
x̄  1.9 1.5 1.5 1.9 
s 0.5 0.6 0.7 0.5 
Session 9 n 22 22 13 13 
x̄  1.8 1.8 1.9 1.8 
s 0.4 0.7 0.8 0.4 
Session 10 n 24 24 13 13 
x̄  1.9 1.8 1.6 2.1 
s 0.7 0.9 0.8 0.6 
 
From the table above, it can be extrapolated that the number of actions for the two interaction techniques are 
approximately the same for all sessions. There is no real discernible pattern that can be determined from the 
number of actions. The number of actions for the speech increase for session 3 and decrease sharply for 
session 4 after which it stabilises. Conversely, the number of actions for the keyboard and mouse start 
decreasing from session 4 and continues decreasing until session 7. Chart 6.14 provides a plot of the mean 
number of actions to assist the analysis for significant differences. 
The following hypotheses were formulated for evaluation of the number of actions: 
1. H0,1: The interaction technique has no effect on the number of actions required to complete the task. 
2. H0,2: There is no difference between the number of actions required to complete the actions between 
the sessions. 
 
139 
 
Chapter 6 Analysis of Speech Commands in Word 
 
Chart 6.14: Mean number of actions to complete the undo task 
 
2
The assumption of sphericity required for the repeated-measures within-subjects ANOVA was violated (χ (35) 
= 137.438, p < 0.05), therefore the tabulated results below show the results of the ANOVA, the multivariate 
tests and the adjusted corrections results. 
Table 6.23: Analysis results for the number of actions to complete the undo task 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 24) = 2.294,     
technique p > 0.05 
Session F(8, 192) = 2.934, F(2.5, 62.4) = 2.934,  F(3.1, 73.5) = 2.934,  F(8, 17) = 3.742,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(8, 192) = 0.690, F(2.5, 62.4) = 0.690,  F(3.1, 73.5) = 0.690,  F(8, 17) = 2.904,  
technique × Session p > 0.05 p > 0.05 p > 0.05 p < 0.05 
 
From Table 6.23 it can be concluded that H0,1 could not be rejected meaning that there was no difference 
between the number of actions required to complete the task when using speech and when using the 
keyboard and mouse. The second hypothesis of no difference could, however, be rejected using a confidence 
interval of 95%. Therefore, there is a significant difference between the number of actions over the sessions. 
Post-hoc tests provided more insight into which sessions differed significantly. It was only session 3 which 
differed significantly from other sessions, namely sessions 4, 6, 7 and 8. Session 3 had a very high average 
number of actions hence it could be concluded that during session 3 significantly more actions were 
performed to complete the task than in sessions 4, 6, 7 and 8. Since this was an isolated incident the overall 
conclusion that could be made from these findings is that to reverse the previous action is as simple when 
using speech as when using the keyboard and mouse, even to such an extent that from the very first session 
the number of actions between the two interaction techniques is on a comparable level. 
 
6.7.5.3 Correctness of task completion 
The simplistic nature of the task required only a single action and, apart from in the final session of the 
keyboard task, all participants completed the task correctly. In the final session of the keyboard task, one 
participant did not complete the task correctly. This was due to the fact that instead of using the keyboard the 
140 
 
Chapter 6 Analysis of Speech Commands in Word 
participant issued the relevant speech command to complete the task. Technically this means the task was 
completed correctly but the participant was penalised since the incorrect interaction technique was used. 
 
6.7.6 Select word and copy 
At the start of this task, for both the interaction techniques, the cursor should have been at the end of a line in 
the document, based on the task directly prior to this one. This meant that the task required the participant to 
select the word directly to the left of the cursor. Therefore, this task tested a different type of selection to the 
previous tasks which had a selection component. 
 
6.7.6.1 Time to complete task 
The underlying table contains the number of observations included in the analysis, the mean of the 
observations and finally the standard deviation. These are arranged in the first, second and third lines 
respectively of each row. 
 Table 6.24: Descriptive statistics for the completion time for selecting and copying a word 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 21 21 13 11 
x̄  37.6 17.6 29.3 15.6 
s 18.1 6.0 11.2 5.0 
Session 3 n 22 21 13 11 
x̄  31.5 19.0 31.2 17.9 
s 19.4 8.7 20.4 9.1 
Session 4 n 24 24 13 11 
x̄  29.6 23.6 24.8 16.1 
s 22.5 12.9 23.4 7.2 
Session 5 n 22 22 13 11 
x̄  23.1 15.3 25.1 13.6 
s 14.1 7.4 17.3 9.0 
Session 6 n 22 22 13 11 
x̄  15.3 14.5 12.7 11.9 
s 6.7 5.5 5.2 3.4 
Session 7 n 22 22 13 11 
x̄  23.9 17.9 23.2 14.6 
s 17.2 8.2 15.5 6.3 
Session 8 n 20 20 13 11 
x̄  19.9 17.9 19.3 15.3 
s 14.9 7.0 17.6 6.8 
Session 9 n 22 22 13 11 
x̄  20.2 15.9 19.8 13.1 
s 14.8 6.1 16.0 5.9 
Session 10 n 24 24 13 11 
x̄  20.7 17.0 16.5 13.9 
s 10.9 8.8 5.1 6.1 
 
Through inspection of Table 6.24, it can be inferred that the speech interaction technique required more time 
to complete the task for all of the sessions. There are isolated sessions, for example session 6, where the 
141 
 
Chapter 6 Analysis of Speech Commands in Word 
completion times appear to be on a more comparable level between the two interaction techniques. Chart 
6.15 is a plot of the mean number of actions for both interaction techniques over all sessions. 
 
Chart 6.15: Mean plot for the completion time for selecting and copying a word 
 
A repeated-measures within-subject ANOVA will be used to determine whether these differences are 
significant. The following hypotheses were formulated: 
1. H0,1: There is no difference in the time to complete the task when using the different interaction 
techniques. 
2. H0,2: There is no difference between the time to complete the task over the different sessions. 
2
The assumption of sphericity was met (χ (35) = 39.456, p < 0.05) at an α-level of 0.05. Using a confidence level 
of 95%, H0,1 could not be rejected (F(1, 22) = 3.655, p > 0.05) but H0,2 could be rejected (F(8, 176) = 3.470, p < 
0.05). The multivariate tests confirmed that H0,2 could be rejected (F(8, 15) = 3.103, p < 0.05).  
These results show that the interaction technique does not impact the time required to complete the task of 
selecting a word and copying it. Since there were multiple sessions, a post-hoc test was required to determine 
which sessions differed significantly. Tukey’s HSD test showed that session 2 differed significantly from 
sessions 6 and 8. Session 3 also differed significantly from session 6. Session 2 had a significantly longer 
completion time than the other two and session 3 had a significantly longer time than session 6. 
To conclude, it can be said that although the speech commands necessitate a longer time to select a word and 
copy it, this longer time is not significantly different to that of the keyboard and mouse. Therefore, with regard 
to selecting text to the left of the cursor, the same efficiency can be achieved with the two interaction 
techniques. 
During the previous selection task the keyboard was significantly faster when used to select a word and apply 
formatting. This task also required a word to be selected, and in this instance to the left of the cursor, it was 
surmised that it might be slightly more complicated. The additional components to the task, namely bold and 
copy were considered to be inconsequential as both are very common tasks and required the same number of 
actions. The complexity of the task was viewed to be the selection of the required text. The fact that there was 
no significant difference could possibly be attributed to two reasons. Firstly, it could be that the selection of a 
word to the left of the cursor provided more of a challenge with the keyboard and mouse than selecting to the 
right. Secondly, it could possibly be that the previous task jogged the memory of the participants enough that 
142 
 
Chapter 6 Analysis of Speech Commands in Word 
they could effectively recall the required command – so much so that they could select words on a comparable 
rate to that of the keyboard and mouse. Closer inspection of the mean times for both of these tasks indicated 
that the completion rate was fairly similar for the keyboard but was lower for the speech for the second task. 
This would seem to imply that the second proposition is more plausible than the first. Therefore repetition of 
similar tasks allows the commands for the latter tasks to be recalled and executed easier than the first time 
the task is encountered in each session of use. However, bearing in mind that the previous task required 
selection of two words and this task the selection of only a single word, credence could be lent to the first 
supposition as well. It would seem that both suppositions have some merit. Of course, this has not been 
analysed statistically and is based purely on observation of the data spread and speculation about the findings 
of the analysis. 
 
7.7.6.2 Number of actions 
The minimum number of actions required for the completion of this task was, once again, similar for the two 
interaction techniques. Table 6.25 summarises the descriptive statistics for the number of actions for this task. 
 Table 6.25: Descriptive statistics for the number of actions to select and copy text 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 22 20 13 8 
x̄  14.0 17.0 8.4 13.0 
s 12.2 22.6 5.1 11.3 
Session 3 n 23 21 13 8 
x̄  10.9 24.9 10.8 11.5 
s 7.5 32.2 8.0 11.5 
Session 4 n 24 24 13 8 
x̄  10.0 38.6 8.7 22.0 
s 9.3 42.4 8.9 30.7 
Session 5 n 23 20 13 8 
x̄  9.5 12.7 9.3 12.5 
s 9.5 13.3 7.6 14.7 
Session 6 n 23 20 13 8 
x̄  6.1 7.5 4.8 6.6 
s 6.1 5.8 2.7 4.2 
Session 7 n 22 21 13 8 
x̄  9.0 16.4 8.5 14.9 
s 9.0 17.9 7.2 18.1 
Session 8 n 20 19 13 8 
x̄  7.3 20.9 7.5 16.8 
s 7.3 23.8 8.2 18.9 
Session 9 n 22 20 13 8 
x̄  7.5 13.3 7.9 10.5 
s 7.5 11.9 7.7 10.1 
Session 10 n 24 24 13 8 
x̄  7.9 19.4 6.6 19.6 
s 7.9 22.2 3.3 23.2 
 
The number of actions for the speech interaction technique varies in the same range from session 4 onwards. 
Sessions 2 and 3 have a slightly higher number of actions and session 6 has the lowest mean number of actions 
for the speech interaction technique. The number of actions for the keyboard, on the other hand, fails to 
143 
 
Chapter 6 Analysis of Speech Commands in Word 
stabilise and continues rising and falling sharply throughout the sessions (see Chart 6.16 for the mean graph). 
Nevertheless, the keyboard task has, on average, more actions than that of the speech interaction techniques 
for all sessions.  
 
 
Chart 6.16: Mean for the number of actions to select and copy text 
 
The hypotheses below were formulated to determine if these differences were non-significant: 
1. H0,1: The interaction technique has no effect on the number of actions required to complete the task. 
2. H0,2: The number of actions does not differ between the sessions. 
2
The assumption of sphericity was not met by the spread of data for the number of actions (χ (35) = 126.721, p 
< 0.05). Table 6.26 below contains the results of the required analyses to evaluate the afore-mentioned 
hypotheses. 
Table 6.26: Analysis results for the number of actions required to select and copy text 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 19) = 3.498,     
technique p > 0.05 
Session F(8, 152) = 1.378, F(3.2, 60.4) = 1.378, F(4.1, 77.8) = 1.378, F(8, 12) = 1.801,  
p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Interaction F(8, 152) = 1.099, F(3.2, 60.4) = 1.099,  F(4.1, 77.8) = 1.099,  F(8, 12) = 0.074,  
technique × Session p > 0.05 p > 0.05 p > 0.05 p > 0.05 
 
Neither H0,1 nor H0,2 could be rejected at an α-level of 0.05. Therefore, it could be concluded that neither the 
interaction technique nor the session affects the number of actions required to complete the task. Similar to a 
previous task, this task displays the same phenomenon that the time for the speech interaction technique is 
more than for the keyboard but that the actions are fewer. Nonetheless, the differences are not significant, so 
when selecting a word and copying an equivalent efficiency is achieved. The fact that the number of actions 
for the keyboard is higher than that for the speech could again be attributed to the selection technique used. 
Similar to previous selection tasks, the selection with the keyboard is achieved by selecting the characters 
individually instead of using a more efficient means such as the combination of the [Control] and [Shift] keys or 
144 
 
Chapter 6 Analysis of Speech Commands in Word 
a mouse selection. However, since the difference is not significant, the selection method used is of little 
consequence to the efficiency of task completion in terms of the number of actions. 
An observation that was made during data capturing was that participants expected some sort of feedback 
when issuing the verbal copy command even though there is no such feedback with the counterpart action for 
the keyboard. This may well be due to the fact that the user will at least be sure that they had clicked the 
correct menu option or used the correct keyboard shortcut but they could not be sure that the speech 
command had been correctly interpreted by the speech engine. Therefore, it becomes imperative that for 
commands with no visible result there must be feedback of some sort so that the user can be reassured that 
the command has been executed. 
 
6.7.6.3 Correctness of task completion 
The steps required to complete this task were as follows: 
1. A portion of text must be selected. 
2. Specifically the last word on the current line must be selected. 
3. The selection must be copied to the clipboard. 
Chart 6.17 below is a stacked bar graph which shows the number of participants in each score category for 
both interaction techniques and all sessions. 
 
25
20
Score 
15
18
20 20 21 22
22 22 22 21 21 20 20
3
20 21 22 22 20 1910 2
1
0
5
5
0 3 31 1 1 1 2 1 2 2
3 2
0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0
Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 Session 8 Session 9 Session 10
 
Chart 6.17: Correctness of task completion - Select word and copy 
 
145 
 
Number of participants 
Speech
Keyboard
Speech
Keyboard
Speech
KKeeyybbooaarrdd
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Chapter 6 Analysis of Speech Commands in Word 
Similar to previous tasks, the majority of the participants completed the task 100% correctly with either 
interaction technique and from the very first session. As with the previous selection tasks, the reason for the 
lower scores was usually because the participant selected text other than that was specified in the task 
instruction. This happened for both interaction techniques and should not significantly affect the use of the 
interaction techniques. 
 
6.7.8 Position and Paste 
This task required that the previously copied word be pasted after the second word of the current line. 
Therefore, both tasks required that the cursor be correctly positioned and then the contents of the clipboard 
had to be inserted at that position. Again, the minimum number of actions required to complete the tasks 
were similar for the different interaction techniques. 
 
6.7.8.1 Time to complete the task 
The table below summarises the descriptive statistics for the completion rate of the task. 
Table 6.27: Descriptive statistics for completion time to position cursor and paste text 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 21 22 11 13 
x̄  53.0 23.1 59.1 19.8 
s 40.7 10.1 47.6 8.7 
Session 3 n 21 22 11 13 
x̄  28.1 17.8 22.9 14.7 
s 15.4 7.7 11.5 4.7 
Session 4 n 23 24 11 13 
x̄  38.8 14.4 31.5 11.9 
s 31.6 8.8 20.0 4.0 
Session 5 n 22 23 11 13 
x̄  25.8 14.7 26.0 13.4 
s 16.1 8.0 13.6 7.4 
Session 6 n 22 23 11 13 
x̄  31.9 11.9 21.8 10.8 
s 18.1 3.6 11.7 3.1 
Session 7 n 22 22 11 13 
x̄  29.4 14.3 21.1 14.0 
s 18.0 5.9 11.3 6.4 
Session 8 n 20 20 11 13 
x̄  27.4 13.6 27.1 12.7 
s 15.0 5.8 17.3 6.5 
Session 9 n 22 22 11 13 
x̄  25.9 12.2 21.2 12.2 
s 15.2 4.0 8.6 4.7 
Session 10 n 24 24 11 13 
x̄  24.8 12.1 23.7 11.3 
s 17.5 4.3 16.2 4.1 
 
Sessions 2 and 4 are the only sessions where the speech interaction technique has a completion time that does 
not appear to be comparable to that of the keyboard. Nevertheless, throughout all the sessions, the speech 
146 
 
Chapter 6 Analysis of Speech Commands in Word 
interaction technique had a higher average completion time than the keyboard and mouse. Similar to previous 
tasks, there is continual improvement in the completion rate for the speech interaction technique as exposure 
to the application is prolonged. Chart 6.18 provides a plot of the means for the two interaction techniques 
over all the sessions. 
 
 
Chart 6.18: Mean plot for completion time to position cursor and paste text 
 
The following hypotheses will be used to determine whether the difference is significant: 
1. H0,1: The interaction technique has no effect on the time taken to complete the task. 
2. H0,2: The time taken to complete the task does not differ significantly between the sessions. 
2
The assumption of sphericity was not met (χ (35) = 71.833, p < 0.05), therefore Table 6.28 contains the results 
of the ANOVA and the multivariate tests as well as the adjusted corrections. 
Table 6.28: Analysis results for completion time to position cursor and paste text 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 22) = 15.448,     
technique p < 0.05 
Session F(8, 176) = 5.123, F(4.0, 89.0) = 5.123, F(5.3, 116.4) = 5.123, F(8, 15) = 5.705,  
p < 0.05 p < 0.05 p < 0.05 p > 0.05 
Interaction F(8, 176) = 0.936, F(4.0, 89.0) = 0.936,  F(5.3, 116.4) = 0.936,  F(8, 15) = 1.986,  
technique × Session p > 0.05 p > 0.05 p > 0.05 p > 0.05 
 
From the table above it can be extrapolated that both H0,1 and H0,2 could be rejected. This leads to the 
conclusion that the interaction technique does significantly affect the time taken to complete the task; 
specifically when using the keyboard and mouse, the task can be completed in a noticeably faster time than 
when using the speech interaction technique. Tukey’s HSD post-hoc test showed that session 2 differed 
significantly from all other sessions where session 2 required a longer time to complete the task than any 
other session. Once again, the fact that the first session where the test was completed required more time to 
complete, can be attributed to the participant’s lack of experience with the application. 
 
147 
 
Chapter 6 Analysis of Speech Commands in Word 
6.7.8.2 Number of actions 
The task could be completed using the same minimum number of actions for both interaction techniques. 
However, closer inspection of Table 6.29 shows that the speech interaction technique resulted in more actions 
being performed in order to complete the task. This observation holds for all sessions, although it is 
encouraging to see that the number of actions decreased with each session and stabilises within the same 
range from session 5 onwards. 
Table 6.29: Descriptive statistics for the number of actions to position the cursor and paste text 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 21 22 9 9 
x̄  25.1 6.6 21.3 6.3 
s 23.3 4.7 10.8 3.6 
Session 3 n 21 22 9 9 
x̄  12.2 5.4 9.7 4.6 
s 6.8 3.1 6.3 1.7 
Session 4 n 24 21 9 9 
x̄  21.2 4.1 15.2 4.1 
s 20.6 2.6 10.8 1.5 
Session 5 n 22 21 9 9 
x̄  10.9 4.6 10.4 3.4 
s 7.5 3.9 6.1 1.9 
Session 6 n 22 20 9 9 
x̄  11.5 3.3 9.0 2.8 
s 6.3 1.6 6.1 0.8 
Session 7 n 22 22 9 9 
x̄  11.1 5.2 7.8 5.9 
s 6.3 4.2 4.5 4.3 
Session 8 n 20 20 9 9 
x̄  9.4 4.0 7.6 3.7 
s 7.7 2.9 4.2 1.9 
Session 9 n 22 21 9 9 
x̄  10.2 4.6 7.1 5.3 
s 6.6 2.3 4.2 3.0 
Session 10 n 24 23 9 9 
x̄  9.2 4.0 7.9 3.7 
s 6.3 1.8 4.6 1.7 
 
Chart 6.19 plots the means for the number of actions over all sessions. 
The underlying hypotheses were formulated to analyse the actions for this task: 
1. H0,1: The interaction technique does not significantly affect the number of actions required to 
complete the task. 
2. H0,2: The session has no effect on the number of actions performed to complete the task. 
The repeated-measures within-subjects ANOVA showed that there was significant interaction (F(8, 128) = 
4.256, p < 0.05) between the two factors at an α-level of 0.05. Therefore, in order to compensate for the effect 
of the interaction, separate ANOVAs had to be performed for one factor whilst the other factor is controlled 
for. 
 
148 
 
Chapter 6 Analysis of Speech Commands in Word 
 
Chart 6.19: Mean number of actions to position the cursor and paste text 
When investigating the first null hypothesis, it was found that there was a significant difference between the 
interaction techniques over all sessions. In particular, this means that the speech commands required 
significantly more actions than the keyboard. Even though the number of actions decreased over the sessions, 
which indicates learning, the learning does not allow the speech to perform on a comparable level to the 
keyboard. The higher number of actions for the speech interaction technique could be explained by the types 
of commands that were issued. Therefore, an analysis was conducted to determine which commands were 
issued during the completion of this task. This showed a high incidence of the command “Right” which could 
be used to move the cursor to the right. This indicated that the participants resorted to moving the cursor to 
the correct position one character at a time. Obviously very few participants realised that they could use the 
command “Select word” to move the cursor to the right one word at time. This would offer a far quicker way 
to move the cursor and would contribute only a single action to the task completion. Since the keyboard and 
mouse offer the alternative of simply clicking the mouse pointer at the correct position this could account for 
the significant difference between the two interaction techniques. This finding could mean that the 
participants do not seek the most efficient method of task completion. Moreover, it could mean that they do 
not explore the use of commands that may yield the same final result but not the same intermediate results. In 
other words, “Select word” followed by the command “Right” will allow a user to move to the right, one word 
at a time yielding the desired final result. However, since it selects the word it does not have the same 
intermediate result of moving the cursor that the “Right” command does and it may have escaped the notice 
of participants as a possible command to complete the task. Familiarity with general cursor movement seems 
the obvious choice for the participants within the scope of the task and they failed to explore more efficient 
commands. This observation may also hold for the previous tasks which could also be completed with fewer 
actions but a more obscure string of actions, but where participants still navigated using a simpler but longer 
string of commands. The shorter route for completing the task was communicated to the participants if they 
were unable to discern this themselves. This could account for the lower number of mean actions as the 
exposure increased. 
The ANOVA performed to evaluate H0,2 for the speech commands showed that there was a significant 
difference between the sessions (F(8, 64) = 5.820, p < 0.05). Post-hoc tests indicated that there was significant 
improvement between session 2 and the remainder of the sessions. Similarly, H0,2 could be rejected for the 
keyboard and mouse (F(8, 64) = 2.287, p < 0.05). Tukey’s HSD post-hoc test did not highlight any significant 
difference between any sessions; therefore the less conservative Fisher’s LSD post-hoc test was used. This 
showed that session 2 differed significantly from sessions 5, 6, 8 and 10. Session 9 also differed significantly 
from session 6. Therefore, there was a noticeable effect of learning between session 2 and the remainder of 
the sessions and since session 2 was the first time the tasks were completed, the higher number of actions 
149 
 
Chapter 6 Analysis of Speech Commands in Word 
could easily be attributed to the first-time use of the application. The shorter means mentioned previously of 
positioning the cursor was communicated to participants and while there were fewer incidents of moving the 
cursor a character at a time, this still occurred during all sessions. Nevertheless, the number of actions 
decreased with significant effect between the first and latter sessions and this does indicate an increased 
familiarity with the application such that a significant learning effect is observed. 
 
6.7.8.3 Correctness of task completion 
The task correctness was evaluated according to the following criteria: 
1. An insertion took place, regardless of its position. 
2. The insertion was after the second word. 
The chart below is a stacked graph showing the results of both interaction techniques for all sessions. 
25
20
6 13 13 11
15 13 14 1615 14 10 16 19
13 16 16
17 17 Score 
14
2
10
14 1
5 10 10 10 118 7 9 9 8 0
6 4 5 5 5 5 6 5
0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 Session 8 Session 9 Session 10
 
Chart 6.20: Correctness of task completion - Position and paste 
 
This task has the poorest results in terms of correctness of task completion. There is an approximately even 
split between participants scoring 1 and those scoring 2. The main reason for this was the fact that the task 
instruction was to paste the word after the second word in the current line. Most participants pasted the text 
as the second word. Therefore, the wording of the task was perhaps not so clear which resulted in a lower 
correctness for the task. Consequently, it is not the interaction technique which caused the lower correctness 
and it can be concluded that the interaction technique does not affect the correctness of task completion 
when positioning and pasting. 
 
6.7.9 Select all and form at
This task required that all the text in the document be selected and italicised through the use of speech 
commands. There was no counterpart using the keyboard and mouse. Therefore, the analysis had to 
determine whether there was an improvement or decline in the time taken to complete the task over the 
sessions.  
150 
 
Number of participants 
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Speech
Keyboard
Chapter 6 Analysis of Speech Commands in Word 
6.7.9.1 Time to complete task 
The number of participants who completed the task, the mean time required to complete the task and the 
standard deviation for the data are summarised in Table 6.30. 
Table 6.30: Descriptive statistics for the completion time to select and format all text 
  All Participants 
participants completing all sessions 
  Speech Speech 
Session 2 n 25 13 
x̄  30.7 27.1 
s 20.9 17.6 
Session 3 n 23 13 
x̄  18.0 16.0 
s 9.6 9.3 
Session 4 n 24 13 
x̄  15.1 12.0 
s 7.3 5.2 
Session 5 n 23 13 
x̄  12.9 11.5 
s 5.1 3.6 
Session 6 n 24 13 
x̄  13.2 13.2 
s 8.3 9.6 
Session 7 n 22 13 
x̄  9.9 9.9 
s 3.2 3.9 
Session 8 n 20 13 
x̄  12.9 11.8 
s 5.2 5.4 
Session 9 n 22 13 
x̄  10.5 10.4 
s 5.4 5.7 
Session 10 n 24 13 
x̄  9.8 10.4 
s 4.8 6.4 
 
Chart 6.21 provides a visual representation of the spread of data as it is a plot of the means for all sessions. 
 
Chart 6.21: Means for the completion time to select and format all text 
151 
 
Chapter 6 Analysis of Speech Commands in Word 
The table and chart show that there is a steady decrease in the completion time of the task for the first few 
sessions, thereafter stabilising to a relatively steady pace. The following hypothesis was evaluated: 
H0: The session has no effect on the task completion time. 
2
The assumption of sphericity (χ (35) = 24.843, p > 0.05) was met and H0 (F(8, 96) = 5.351, p < 0.05) could be 
rejected at an α-level of 0.05. Therefore, there was a significant decrease in completion time. In particular, the 
completion time of session 2 was significantly higher than the time for session 4 to 10. This provides evidence 
that there was significant improvement in task completion time as exposure to the application increased. 
 
6.7.9.2 Number of actions 
The difference in the number of actions was also analysed in order to determine whether there were any 
differences between the sessions. The table below summarises the descriptive statistics for the number of 
actions performed. The underlying chart is a plot of the mean number of actions for all sessions. 
Table 6.31: Descriptive statistics for the number of actions to select and format all text 
  All Participants 
participants completing 
all sessions 
  Speech Speech 
Session 2 n 25 14 
x̄  10.7 8.9 
s 8.7 6.3 
Session 3 n 23 14 
x̄  6.3 6.0 
s 3.9 4.6 
Session 4 n 24 14 
x̄  5.3 4.6 
s 3.2 3.1 
Session 5 n 23 14 
x̄  4.7 4.2 
s 2.8 2.7 
Session 6 n 24 14 
x̄  4.8 4.7 
s 3.3 3.5 
Session 7 n 22 14 
x̄  3.8 3.9 
s 1.7 1.8 
Session 8 n 20 14 
x̄  5.6 5.1 
s 2.7 2.8 
Session 9 n 22 14 
x̄  4.0 4.3 
s 2.5 3.0 
Session 10 n 24 14 
x̄  4.5 4.9 
s 2.8 3.4 
 
152 
 
Chapter 6 Analysis of Speech Commands in Word 
 
Chart 6.22: Mean number of actions to select and format all text 
 
As with the time, there is a sharp decline in the number of actions performed in the first few sessions, after 
which the mean stabilises within a certain range. The following hypothesis was formulated: 
H0: The session has no effect on the number of actions required to complete the task. 
2
The assumption of sphericity required for a repeated-measures ANOVA was violated (χ (35) = 54.057, p < 
0.05). H0 could be rejected at an α-level of 0.05 (F(8, 104) = 2.718, p < 0.05) meaning that the session does 
have an impact on the number of actions required to complete the task. The adjusted corrections were as 
follows: F(4.1, 53.1) = 2.718, p < 0.05 and F(6.2, 80.5) = 2.718, p < 0.05.  
Tukey’s HSD post-hoc test indicated that session 2 differed significantly from sessions 4 to 7 and 9. Session 2 
required significantly more actions to complete the task than was the case in the other sessions. Overall, this 
shows that only the first session where the tasks were completed had a significantly higher completion time 
and number of actions. This is sufficient evidence to show that there is ample improvement in the efficiency of 
task completion.  
 
6.7.9.3 Correctness of task completion  
This task had the following criteria against which it was evaluated: 
1. A portion of text had to be selected. 
2. The selection should include all text in the document. 
3. Italic formatting must be applied. 
All participants, apart from one individual in session 3 who scored 2, managed to complete this task correctly 
for all sessions. Therefore, the interaction technique allows for the task to be completed correctly from the 
first session. 
 
6.8 Summary of results 
This chapter contains a substantial number of analyses and results, and this section will be used to summarise 
these findings in a comprehensive but much more succinct way. Table 6.32 contains a summary of the results 
153 
 
Chapter 6 Analysis of Speech Commands in Word 
for each of the tasks. Where there was a significant difference detected between the interaction techniques an 
S will be used if the speech interaction technique had a lower mean completion measurement and a K will be 
used if the keyboard and mouse had a lower mean completion measurement. A blank cell indicates that there 
was no difference between the interaction techniques. The same technique will be used for the session, where 
a tick indicates there was improvement over the sessions for both interaction techniques.  Note that while 
there are significant differences indicated for the completion times between sessions, this does not imply that 
all sessions differed significantly but rather that there were some sessions which facilitated faster completion 
times as exposure increased. 
 Table 6.32: Summary of significant results 
 H0,1: Interaction technique H0,2: Session 
 Completion Number of Completion Number of 
time actions time actions 
Line selection and formatting    S 
Select all and remove S S   
Select words and format K    
Paste S S   
Undo     
Select word and copy     
Position and paste K K   
 
The speech interaction technique performed relatively well when compared with the keyboard and mouse, in 
some instances even surpassing the performance of the traditional input methods. Clearing of all text in the 
document and pasting were even faster and completed with fewer actions than when using the keyboard and 
mouse. It is only when positioning within the document must occur that the keyboard outperforms the speech 
interaction technique in terms of both measurements. 
While this finding was very encouraging, the most promising finding was that there was continued 
improvement in the efficiency with which the task was completed. Even though the improvement between 
subsequent sessions was not always significant, the fact that there was continual improvement hints at the 
possibility that the two interaction techniques could eventually compete on a comparable level for all tasks or 
that the speech interaction technique could eventually perform better. At the very least the final sessions 
usually showed significantly better performance than the first sessions. The fact that improvement is shown 
over the sessions is testimony to the fact that the speech commands are easy to learn and remember. Few 
participants had to refer back to the command list provided after the first session and could manage to 
complete the tasks with speeds comparable to those achieved with the keyboard and mouse. Therefore, the 
use of a menu-orientated grammar allowed the speedy adoption thereof and did not appear to place 
additional strain on the users. Although the menu-orientated grammar was not compared with a task-
orientated grammar, the fact that the grammar was so quickly learnt is motivation enough to recommend the 
use of a menu-orientated grammar for a word processor. Even though users tend to resort to task-orientated 
commands when faced with a complex task (Berg et al., 2010), the assumption that was made that the 
terminology of the word processor is unique and part of the initial learning process was proven in the current 
study. It is therefore, surmised that complex tasks will also be facilitated in this way since even under those 
circumstances there is a unique grammar already in place which can quickly be learnt by the users. 
Since there are often multiple options available to the user to complete the task when using the traditional 
means, the most effective method was not always chosen. This was also noticed when using speech to move 
the cursor − the most effective method is not always chosen. Rather the user chooses the method which 
results in an intermediate action which is closer to the final result even though in reality there is a shorter 
method that can be used. However, the immediate responses may be contradictory to the final solution or 
154 
 
Chapter 6 Analysis of Speech Commands in Word 
simply not appear to move the user closer to the desired goal. For example, in order to position the cursor, the 
user may prefer to move the cursor one character in the desired direction consecutive times since the 
intermediate results move the task closer to completion. In contrast, selection of a word does not immediately 
appear to move the cursor when in effect it does and at a much faster rate than moving one character at a 
time. 
Successive tasks that make use of similar speech commands also appear to show that the initial task jogs the 
memory of the participant for the current session to such an extent that the subsequent tasks can be 
completed more efficiently. This observation was however not analysed statistically and is only suspicion 
based on the available data and observations of the participants during test completion. 
The fact that the speech commands resulted in fewer actions may be attributed to the fact that the grammar 
that was used was fairly simple and provided commands to complete basic operations only. The complexity of 
the options provided by Word is much higher than that accommodated in the grammar. This may have led to 
expedited speeds and actions to complete tasks as there was, in many instances, only a single command 
available to complete a task. In contrast, when using Word in the normal capacity there is, more often than 
not, at least three different ways to complete a task which may place an added burden on the user of the 
application. However, the goal of the study was not to provide a complete alternative to the keyboard and 
mouse but rather to determine whether common word processing tasks could be achieved using an 
alternative interaction technique. Therefore, by the very nature of the study, the grammar was required to be 
simple in composition. However, judging by the results tabulated above, this did not have an impact on the 
results as, in most instances, the interaction techniques could perform at a level that was comparable. The 
tasks where the speech outperformed the keyboard perhaps had the most intuitive speech commands: this 
could explain why the speech task could be completed quicker and with fewer actions than its keyboard 
counterpart. The remainder of the grammar was less intuitive and may have required some time to memorise 
and learn which could account for the fact that the speech was not, in these instances, faster than the 
keyboard. However, in these instances it was at least comparable to the keyboard. 
In terms of the correctness of the tasks, there was very little difference between the interaction techniques. 
Furthermore, the majority of the tasks could be completed correctly from the very first session and using 
either interaction technique. The instances where there were tasks which were not completed correctly were 
largely due to the participants not reading the instructions properly or the fact that the wording of the 
instructions may not have been very clear. Owing to the large number of zeros in the categories, meaningful 
statistical analysis could not be conducted. Even so, it could be inferred that the interaction technique did not 
affect the level of correctness with which the task could be completed. Furthermore, since high correctness 
measurements were achieved from the very first session there was also no learning required in order to 
complete the task correctly.  
Overall, this study disproved the notion that speech commands cannot be used effectively and efficiently in an 
editing environment (Klarlund, 2003). The efficiency was for the most part similar or superior to the keyboard 
and mouse. Effectiveness for the two interaction techniques was on an equivalent level. Therefore, there is 
confirmation of the findings by Karl et al. (1993) that speech is able to provide a more efficient and effective 
means of completing word processing tasks. The study of Karl et al. (1993) allowed the modalities of speech 
and mouse/keyboard to be mixed and also did not provide for text selection using speech commands. The 
current study therefore also improved on these prior findings. 
 
6.9 Further research 
The tasks that were chosen for this part of the study were chosen as some of the more common tasks that 
may occur in the word processing application. Therefore, they may be viewed as some of the less complex 
155 
 
Chapter 6 Analysis of Speech Commands in Word 
tasks and other tasks may require less intuitive commands and more complex commands. However, this will 
parody the nature of any other system which provides access to common tasks “at your fingertips”, for 
example the Home tab in Office while less used tasks or more complex tasks require further navigation and 
perhaps a heavier burden on one’s memory. It may be possible to extend the grammar to encompass many 
more tasks within the word processor application. Another consideration would be to use a default grammar 
which includes only a smaller grammar and then, when required by the user, the extended grammar can be 
activated. 
The results of the study indicate that using speech could dramatically increase the efficiency of end-users. 
However, it remains to be seen if this result holds when the user is free to use the grammar in a normal 
setting. This would require that the participants would not be given small separate tasks but rather that they 
would have to compile a document from scratch with pre-defined formatting. The participant would then be 
able to issue verbal commands during standard interaction with the application to apply formatting, 
corrections and move around the document. Usability measures for such a task can then be compared to 
measurements recorded when speech commands are not available but the participant has to complete the 
same task. This would give a clearer indication as to whether the incorporation of speech commands in a word 
processing application is a viable alternative to the mouse and keyboard. 
Whether or not an extended grammar is considered, further research will have to be done where the exposure 
to the application is extended in order to determine whether the learning effect can continue to an even 
greater degree. This could mean that the speech could perform on a similar level to the mouse and keyboard 
on a number of other tasks – or eventually even better. Such a study could use a smaller sample as it has 
already been established that it is possible to use this interaction technique effectively. 
 
6.10 Summary 
This chapter reported on the results of similar tasks which were compared when they were completed using 
the mouse and keyboard or when using speech commands. The measurements which were analysed were 
time to complete the task and the number of actions that were performed during completion of the task. The 
correctness with which the tasks could be completed was also measured and analysed. For the majority of the 
tasks it was found that the interaction techniques could compete on a comparable level, particularly as the 
time the participant used the application increased. Therefore, there was a definite improvement in user 
performance as the use of the application was extended. This indicates that the application was indeed 
learnable. Since the speech interaction techniques could also be used with the same efficiency as the keyboard 
and mouse, the proposed use of speech commands within a word processor application is viable. The 
correctness with which the task could be completed was neither affected by the interaction technique nor the 
amount of exposure to the system. 
In conclusion, although the interaction technique affects the time with which some tasks can be completed 
and sometimes the number of actions required it does not affect the correctness with which the task is 
completed. Additionally, the time to complete a task generally improved as exposure to the application 
increased as did the number of actions required. However, the correctness of the task completion was high 
from the very start so although it took longer and required more effort in the first sessions, the tasks were still 
completed correctly. 
The following chapter will report on the analysis of using the onscreen keyboard to type through using eye 
gaze and speech recognition as an interaction technique. 
 
156 
 
 
CHAPTER 7 
ANALYSIS OF TYPING TASKS 
 
7.1 Introduction 
The previous chapter concentrated on the analysis of the use of speech commands for formatting, navigating a 
document and other common word processor tasks. During the longitudinal testing, participants were also 
required to enter text using both the keyboard and eye gaze and speech (analogous to look-and-shoot). The 
buttons on the onscreen keyboard used varying sizes and spacing for the typing tasks. This chapter will analyse 
and discuss the effectiveness and efficiency of eye gaze and speech when used for text input, as compared to a 
traditional keyboard. 
 
7.2 Participants 
Since these tasks were part of the task list set out for the longitudinal testing of the multimodal Word 
interface, the participants for this analysis were the same as in the previous chapter. Therefore, there were 25 
participants who completed the typing tasks. There were, however, three participants who were unable to 
type using eye gaze and speech for various reasons. The first participant was unable to maintain a stable eye 
gaze on any of the buttons on the onscreen keyboard. This behaviour was observed for all ten sessions and 
since there was no improvement and the participant was unable to type even a single character, this 
participant will not be included in this chapter’s analysis. The second participant experienced the same 
problem as the first. This participant wore glasses with thick lenses and a very wide frame. Although an 
acceptable calibration was achieved by the participant, he was unable to select any of the onscreen buttons. It 
is quite possible that his glasses interfered with his ability to type. Therefore, his data was also excluded from 
the analysis in this chapter.  
The third participant could manage to maintain a stable eye gaze on the onscreen buttons but the speech 
engine was unable to recognise the commands he issued to select the button. The participant had an unusual 
pronunciation of some words and also did not enunciate very clearly. Repeated measures were taken to 
attempt to correct this. Firstly, this participant completed additional training sessions to improve the accuracy 
of his speech profile. When this was unsuccessful, special commands were added specifically for this 
participant but while this was initially successful, the participant quickly slipped back into his normal speaking 
tone and his enunciation of the special commands was degraded to such a degree that they no longer worked. 
Therefore, it was considered prudent rather to discard the data of this participant. 
Consequently, the sample size for the typing tasks was twenty-two, comprising of 14 males and 8 females. 
There were 6 English-speaking participants, 6 Afrikaans-speaking and the remainder (10) had an African 
language as their first language. The average age of participants was 21.1 (standard deviation = 2.0) and there 
were 9 Computer Science students and 13 non-Computer Science students. 
 
7.3 Tasks 
In total there were two typing tasks using the keyboard and three using the eye gaze and speech. When using 
eye gaze and speech the size of the buttons was set to 60×60 (≈1.55° visual angle) pixels. Buttons were spaced 
60 pixels apart with a gravitational well of 20 (≈0.52 ° visual angle) pixels on all sides of each button. Since the 
157 
 
Chapter 7 Analysis of Typing Tasks 
results of Chapter 6 showed that the gravitational well was the most effective means of increasing the usability 
of eye gaze and speech as a pointing device, a gravitational well was included in the onscreen keyboard. The 
larger the gravitational well is, the more widely spaced the buttons must be. Consequently, no screen real 
estate is gained through the use of a gravitational well and in order to optimise the aesthetic appeal of the 
onscreen keyboard it was decided rather to decrease the gravitational well so that the buttons could be closer 
together and then to enlarge the buttons to make selection easier.  
Although there were three typing tasks using these settings, only the last two of each session were included in 
the analysis. This was due to the fact that the first one was viewed more as a practice typing task to 
reacclimatise the participants to typing using eye gaze and speech. The participants were not told that the first 
task would not count towards the analysis and were instructed to complete all tasks to the best of their ability. 
Therefore, the analysis included two typing tasks to be completed with a keyboard and two with eye gaze and 
speech. 
Additional typing tasks were added from the fifth session onwards in order to test varying sizes and spacing 
between buttons. These additional tasks were added to the end of the existing task list. By then the majority of 
the participants were completing the current task list in less than 30 minutes. No pressure was placed on the 
participants to complete all tasks within their scheduled time so it was felt that adding additional tasks to the 
end of the test would not unduly cause any more anxiety or place more strain on the participants. Within 
these additional typing tasks, the first one had to be completed using the originally sized and spaced buttons. 
The next two had to be completed with buttons that were 50×50 (≈1.29° visual angle) pixels in size and spaced 
70 (≈1.80° visual angle) pixels apart. Following this there were another two tasks which had to be completed 
using buttons that were 50×50 pixels in size but were spaced 60 pixels apart. For all typing tasks a gravitational 
well of 20 pixels on all sides of the buttons were employed. 
The use of the keyboard will be denoted by K and the larger originally sized buttons by Speech-L. Speech-L was 
used for the first two typing tasks using eye gaze and speech for text input. From session 5 onwards there 
were two typing tasks using smaller buttons which were widely spaced, which will be denoted by Speech-SW, 
and then a final two using smaller buttons which were spaced closer together, namely Speech-SC. 
All text that had to be typed was selected randomly for each task from the set of 35 pre-selected phrases 
(Section 3.4.3.3). Similar to the previous tasks, all the typing tasks were displayed to the participant using a 
window overlaid over the word processor application. 
 
7.4 Measurements 
The measurements that were selected for analysis were the character error rate and the characters typed per 
second. Since both input methods, namely typing with the traditional and the onscreen keyboard, were 
character based, the character error rate and characters per second are a more applicable means of measuring 
the effectiveness and efficiency of the interaction technique (Read, 2005). The character error rate (CER) 
measures how many insertions, deletions and substitutions have taken place between the presented text and 
the transcribed text (Read, 2005). This measurement, which is effectively the minimum number of insertions, 
substitution and deletions, is synonymous with the Levenshtein distance between two strings. As discussed in 
section 3.4.3.2, the Levenshtein distance (Levenshtein, 1965) measures the difference between two strings in 
terms of the minimum number of insertions, substitutions and deletions required to transform one string (in 
this case the presented text) into another (in this case the transcribed text).  This sum is then divided by the 
number of characters to give a character error rate (Read, 2005). Since there are multiple ways in which the 
presented text can be transformed into the transcribed text, using the same minimum number of edits, a more 
accurate means of calculating this character error rate is to determine the number of ways in which the 
transformation can occur (MacKenzie & Soukoreff, 2003). These possible transformations are called the 
158 
 
Chapter 7 Analysis of Typing Tasks 
optimal alignments. Once these optimal alignments have been identified, their mean length is calculated and 
then the Levenshtein distance is divided by, this mean length to give an error rate (MacKenzie & Soukoreff, 
2002). For example (the example is taken from MacKenzie & Soukoreff, 2002), suppose the presented text is 
the word “quickly” and the test participant types “qucehkly”. The Levenshtein difference between these two 
strings is 3, but there are four different ways in which “quickly” can be transformed into “qucehkly” by making 
only 3 errors. These four different ways are referred to as the optimal alignments. The mean length of these 
optimal alignments is then used to divide the Levenshtein distance by to give a more accurate error rate. This 
error rate measurement will be analysed in this chapter as a measure of effectiveness.  
In order to measure the efficiency of the interaction techniques, the characters per second (CPS) measurement 
will be used. This measurement literally measures the number of characters that were typed and then divides 
it by the time taken to type the characters, measured in seconds. Similar to previous studies (MacKenzie, 
2002), the time taken was measured from the time when the first character was typed to the time the last 
character was typed. This excludes the time required to read the question, including the sentence that must be 
typed, which is indistinguishable from the time taken to locate the first character that must be typed – which is 
then also excluded. As a consequence of measuring the time in this manner, the number of characters 
becomes n-1. 
Corrections to transcribed text were not captured as it was felt that the correction would either be another 
key press or speech command that would be issued. Since these were analysed separately in the previous 
chapter it was decided not to include these measurements in the typing tasks as well. 
 
7.5 Analysis 
Analysis will first only be conducted between the tasks which used K and Speech-L. All the chosen 
measurements will be analysed for these tasks. Following this, all measurements will be analysed for the two 
typing tasks (K), the original two eye gaze and speech (Speech-L), the two typing tasks for the smaller buttons 
widely spaced (Speech-SW) and the two using the smaller buttons closer together (Speech-SC). Since speech-
SW and speech-SC were only included from the fifth session onwards and it was not judicious simply to discard 
the first few sessions, it was decided to rather conduct two separate analyses.  
 
7.5.1 Analysis of keyboard and large buttons 
The error rate measurement was calculated individually for all typing tasks and then averaged over the typing 
tasks for each interaction technique and for each participant. As previously mentioned, the first typing task 
using speech and eye gaze was not considered for analysis as this was viewed as a practice task entry for each 
session. The participants were not informed of this and were instructed to complete all tasks to the best of 
their ability.  
 
7.5.1.1 Error rate 
The average error rates, as discussed in a prior section, for each participant and each interaction technique 
were calculated for all sessions. The descriptive statistics for the two keyboard typing tasks and the two 
speech-L typing tasks are tabulated below. The first line of each row is the number of participants who were 
included in the analysis, the second row the mean error rate and the third the standard deviation. 
  
  
159 
 
Chapter 7 Analysis of Typing Tasks 
 Table 7.1: Descriptive statistics for keyboard and speech-L error rate 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 17 21 8 13 
x̄  15.3 7.9 13.9 6.6 
s 7.7 6.7 4.6 5.6 
Session 3 n 20 20 8 13 
x̄  17.7 5.3 18.8 6.4 
s 8.2 5.8 10.0 6.9 
Session 4 n 20 21 8 13 
x̄  15.0 4.0 10.4 4.1 
s 11.2 4.3 6.9 4.7 
Session 5 n 21 21 8 13 
x̄  14.2 6.5 10.1 4.7 
s 9.0 6.7 4.2 6.4 
Session 6 n 20 21 8 13 
x̄  12.8 5.0 9.5 4.0 
s 7.2 5.0 6.4 3.6 
Session 7 n 18 20 8 13 
x̄  12.2 4.9 8.2 4.8 
s 9.2 7.1 7.3 8.3 
Session 8 n 17 18 8 13 
x̄  12.0 5.8 12.2 4.0 
s 6.5 5.2 5.6 4.8 
Session 9 n 18 19 8 13 
x̄  9.3 3.4 6.3 2.6 
s 6.5 4.6 5.7 3.5 
Session 10 n 17 21 8 13 
x̄  9.3 3.9 6.1 3.8 
s 5.9 3.8 4.9 4.2 
 
Chart 7.1 is a plot of the mean error rate for the interaction technique across all sessions. 
 
Chart 7.1: Mean error rate of keyboard and speech-L 
 
160 
 
Chapter 7 Analysis of Typing Tasks 
From Table 7.1 and Chart 7.1 it can be extrapolated that the speech interaction, on average, caused a higher 
error rate than did the keyboard. This observation holds for all sessions, although the error rate for the speech 
interaction technique does improve steadily as the amount of exposure increases. The following hypotheses 
were formulated for this analysis: 
1. H0,1: The error rate is not affected by the interaction technique. 
2. H0,2: The error rate is not affected by the session during which the task was completed. 
The same procedure as in the previous chapter was followed for the analysis of the data (see Section 3.5 for an 
explanation of data analysis)  
2
In this instance, the assumption of sphericity (χ (35) = 54.795, p < 0.05) was not met at an α-level of 0.05. 
Therefore, Table 7.2 reports the results of the repeated-measures ANOVA, the multivariate tests as well as the 
adjusted corrections. 
Table 7.2: Results of error rate analysis for keyboard and speech-L 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 19) = 14.406,     
technique p < 0.05 
Session F(8, 152) = 5.092,  F(4.8, 90.7) = 5.092,  F(6.9, 131.1) = 5.092,  F(8, 12) = 4.818,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(8, 152) = 1.860,  F(4.8, 90.7) = 1.860,  F(6.9, 131.1) = 1.860,  F(8, 12) = 1.633,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
The results contained in the table above show that both null hypotheses could be rejected. Therefore, the 
interaction technique had a significant effect on the error rate of the typed sentence. Since the keyboard had a 
consistently lower average than the eye gaze and speech this means that the use of eye gaze and speech 
interaction for text input results in a higher error rate. 
Post-hoc tests were required to determine which sessions differed significantly. Tukey’s honestly significant 
difference (HSD) test was used to establish the cause of the differences. It was found that session 2 differed 
significantly from session 9 and session 3 differed significantly from sessions 6, 7, 9 and 10. Since the average 
error rate for sessions 2 and 3 was higher than for the remainder of the sessions, this indicates some degree of 
learning over time. Although sequential sessions improved, the rate of improvement was not significant. 
However, the overall improvement from the first sessions to the last was significant. In particular, the first few 
sessions with speech-L differed significantly from all later sessions. The rate at which the eye gaze and speech 
interaction technique improves over time is an encouraging observation and hints that the error rate could 
possibly reach a comparable level with that of the keyboard. More typing sessions would have to be tested 
and analysed in order to verify this supposition. 
The average error rate for each session was then inspected more closely to determine how many participants 
were able to type a completely error-free sentence. The results, for all participants, are shown in the bar graph 
below. 
161 
 
Chapter 7 Analysis of Typing Tasks 
9
9
8
8
7 7
7
6
5
5
4 4
Speech
4
3
Keyboard
3
2
2
1 1 1
1
0 0 0 0 0 0
0
Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 Session 8 Session 9 Session 
10
 
Chart 7.2: Error-free transcribed text for keyboard and speech-L 
 
The graph shows that each session for the keyboard had at least two error-free transcribed text strings. There 
were only three sessions in which an error-free transcribed text string was achieved through the use of the eye 
gaze and speech and in each only by a single individual (although not the same individual more than once). 
This may not bode well for the effectiveness of speech-L for text input, although it is still entirely possible that 
prolonged use may result in decreased error rates as users will become accustomed to the use of the onscreen 
keyboard. Overall, it is the fact that the effectiveness increased as time went by that is of more importance 
than the fact that completely error-free transcription could not be achieved. If the effectiveness continues to 
improve, then eventually an error-free transcribed text should be achieved. Additionally, there are always 
mechanisms available to correct errors and should this be used then the end result of transcribed text may be 
error-free. 
 
7.5.1.2 Breakdown of error rates 
During ISO testing it was observed that the number of incorrect target clicks was significantly higher when 
using eye gaze and speech than when using the mouse. This was attributed to a tendency by the participants 
to acquire the intended target, start issuing the command and then move the eye gaze to the next intended 
target before the command had been processed. This resulted in the next target being selected instead of the 
designated target. If this behaviour was emulated during the typing tasks, it would manifest in a higher error 
rate, which was discovered to be the case. However, clicking the wrong target would specifically result in 
either an insertion error – if the participant realised the error and then inserted the correct character after the 
incorrect character, or a substitution error – if the participant did not realise the error and return to insert the 
correct character. Therefore, in order to determine whether this was true the error rate measured for the 
interaction techniques was broken down into the number of insertions, deletions and substitutions which 
could have occurred in order to transform the presented text into the transcribed text. Each of these was 
expressed as a percentage of the total error rate percentage as illustrated by MacKenzie and Soukoreff (2002). 
For illustration purposes, the first session’s data, as broken down into the categories of insertions, deletions 
and substitutions errors, is shown as a stacked bar graph in Chart 7.3. As can clearly be seen, the highest 
number of edits was insertions, followed by substitutions and finally deletions. This was true for both the 
162 
 
Number of participants 
Chapter 7 Analysis of Typing Tasks 
speech-L interaction technique as well as the keyboard interaction technique. More specifically, the eye gaze 
and speech had a total character error rate of 15.4% which consisted of 6.6% insertions, 5.3% substitutions 
and 3.4% deletions.  
For further illustrative purposes, the same information is provided for the last session (Chart 7.3). While the 
average error rate for the speech decreased, the majority of the errors were still caused by insertions. The 
second most errors were substitutions followed closely by deletions. The same pattern was observed for the 
keyboard interaction technique. 
 
16.0
14.0
5.3
12.0
10.0
Substitutions
3.4 1.8
8.0 0.8 Deletions
0.5 1.5
Insertions
6.0
4.0 6.6 6.5 0.95.9 0.5
2.0 2.6
0.0
Speech Keyboard Speech Keyboard
First task   Last task  
Chart 7.3: Breakdown of first and last task's error rates for keyboard and speech-L 
 
Each of the types of edits was analysed separately, the results of which are discussed in the following sections. 
 
7.5.1.2.1 Insertion error percentage 
The ratio of insertions was calculated for each participant on each interaction technique and for all sessions 
(Table 7.3). These could then be analysed statistically to determine whether the afore-mentioned finding of 
incorrect clicks when using eye gaze and speech has an impact when typing. 
Chart 7.4 shows the plot for the mean average insertion percentage for both interaction techniques over all 
sessions.  
 
  
163 
 
Error rate 
Chapter 7 Analysis of Typing Tasks 
 Table 7.3: Descriptive statistics for insertion errors of keyboard and speech-L 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 17 21 10 13 
x̄  6.6 6.5 6.5 6.3 
s 5.0 6.4 5.2 5.6 
Session 3 n 20 20 10 13 
x̄  9.8 2.4 9.5 3.3 
s 6.9 2.8 7.5 3.1 
Session 4 n 21 21 10 13 
x̄  5.8 2.9 6.0 2.8 
s 4.9 3.8 5.9 4.2 
Session 5 n 21 21 10 13 
x̄  5.9 4.5 4.5 3.0 
s 4.6 4.3 4.8 3.4 
Session 6 n 21 21 10 13 
x̄  5.7 3.7 5.3 3.2 
s 4.3 5.0 5.3 3.3 
Session 7 n 19 20 10 13 
x̄  5.7 2.7 4.6 1.9 
s 4.7 3.5 4.6 3.2 
Session 8 n 18 18 10 13 
x̄  7.9 3.7 8.0 2.7 
s 6.6 5.2 7.0 4.9 
Session 9 n 19 19 10 13 
x̄  6.2 2.4 4.9 1.7 
s 4.3 3.4 4.6 2.6 
Session 10 n 18 18 10 13 
x̄  6.2 2.6 4.4 2.2 
s 5.2 3.1 4.8 2.7 
 
 
 
Chart 7.4: Mean insertion error percentage of keyboard and speech-L 
164 
 
Chapter 7 Analysis of Typing Tasks 
From both the chart and the table it is clear that the speech-L interaction technique had a higher insertion 
error percentage than did the keyboard. There was, however, some improvement over the sessions, apart 
from session 8 where there was a sharp increase in the error rate. The following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the insertion errors percentage. 
2. H0,2: The session has no effect on the insertion errors percentage. 
 
2
The assumption of sphericity was violated (χ (35) = 61.167, p < 0.05), therefore adjusted corrections were 
applied to the degrees of freedom. Table 7.4 contains the results of all the required analysis. 
Table 7.4: Analysis results for insertion error percentage of keyboard and speech-L 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 21) = 6.516,     
technique p < 0.05 
Session F(8, 168) = 2.278,  F(5.1, 107.4) = 2.278,  F(7.3, 152.9) = 2.278,  F(8, 14) = 1.687,  
p < 0.05 p < 0.05 p < 0.05 p > 0.05 
Interaction F(8, 168) = 1.236,  F(5.1, 107.4) = 1.236,  F(7.3, 152.9) = 1.236,  F(8, 14) = 1.646,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
Using a 95% confidence interval, both H0,1 and H0,2 could be rejected. Therefore, there is a significant 
difference between the percentage of insertion errors made when using the different interaction techniques. 
Since the interaction technique of eye gaze and speech has, on average, more insertion errors, it could be 
concluded that the prior supposition was indeed correct.  
Tukey’s post-hoc test did not indicate significant differences between any sessions, but the less conservative 
Fisher’s LSD test did. There was a significant difference between session 2 and sessions 5, 7, 9 and 10. Session 
3 also differed significantly from sessions 7, 9 and 10. In particular, it was session 3 of the speech-L which was 
significantly higher than the majority of the other sessions with either interaction technique. Session 3 with 
the speech-L had a very high percentage of insertion errors, therefore during that session, the participants 
made significantly more insertion errors than in any other session with either interaction technique. 
 
7.5.1.2.2 Substitution error percentage 
The same procedure as in the previous section was used to determine the ratio of substitution errors. 
Descriptive statistics are tabulated below. 
Using Chart 7.5 and Table 7.5 as a reference it is clear that for the first seven sessions, the eye gaze and speech 
averages a much higher substitution percentage than the keyboard. It was only during the final two sessions 
that the number of substitutions for the interaction techniques reached levels that are possibly comparable to 
one another. 
The following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the percentage of substitution errors made. 
2. H0,2: There is no difference between the percentages of substitution errors made between the 
sessions. 
  
165 
 
Chapter 7 Analysis of Typing Tasks 
Table 7.5: Descriptive statistics for substitution error percentage of keyboard and speech-L 
  All participants Participants completing 
all sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 17 21 8 13 
x̄  5.3 0.8 5.2 0.3 
s 5.4 1.3 4.6 0.7 
Session 3 n 19 20 8 13 
x̄  4.3 0.9 3.4 0.9 
s 4.6 2.3 4.6 2.4 
Session 4 n 19 21 8 13 
x̄  2.9 0.5 3.1 0.4 
s 3.4 0.9 2.9 0.9 
Session 5 n 21 21 8 13 
x̄  6.5 0.9 2.4 0.6 
s 7.5 1.8 3.0 1.5 
Session 6 n 20 21 8 13 
x̄  3.3 0.5 1.2 0.5 
s 4.9 1.0 2.4 0.9 
Session 7 n 20 20 8 13 
x̄  4.9 0.8 3.8 1.0 
s 5.2 1.8 4.6 2.1 
Session 8 n 18 18 8 13 
x̄  2.2 0.3 1.1 0 
s 2.5 0.8 1.7 0 
Session 9 n 18 19 8 13 
x̄  1.72 0.51 1.7 0.5 
s 2.51 1.28 1.9 1.2 
Session 10 n 18 21 8 13 
x̄  2.41 0.86 0.9 1.1 
s 3.42 1.73 1.3 2.0 
 
When analysing the null hypothesis, it was found that there was significant interaction (F(8, 152) = 2.205, p < 
0.05) between the two factors of interaction technique and session. Therefore, each session was analysed 
individually to determine whether there was a significant difference between the percentage of substitutions 
for each interaction technique during that session. It was found that H0,1 could be rejected for all sessions 
other than the last two. Therefore, for sessions 2 to 8 the use of the speech-L interaction technique resulted in 
participants making significantly more substitution errors.  
Since it was only the interaction technique of eye gaze and speech that was of interest, the second null 
hypothesis of no difference was only applied to the speech-L interaction technique. The percentage of 
2
substitution errors violated the assumption of sphericity (χ (35) = 54.808, p < 0.05), thus Table 7.6 reports all 
required analysis that was performed on the data. 
The second null hypothesis could not be rejected, therefore the percentage of substitution errors does not 
improve as the use of the eye gaze and speech for text input increases. 
 
166 
 
Chapter 7 Analysis of Typing Tasks 
 
Chart 7.5: Mean substitution error percentage of keyboard and speech-L 
 
 
 Table 7.6: Results for the analysis of session for speech-L substitution errors percentage 
 ANOVA Geisser-Greenhouse Huyn-Feldt 
Session F(8, 56) = 1.623,  F(2.8, 19.3) = 1.623,  F(4.7, 33.0) = 1.623,  
p > 0.05 p > 0.05 p > 0.05 
 
 
7.5.1.2.3 Deletion error percentage 
The final category of error percentages was the deletion percentage. The percentage of possible deletion 
errors was calculated for all participants for each session’s typing tasks using both interaction techniques. 
Descriptive statistics for the data are summarised in Table 7.7. 
Chart 7.6 and Table 7.7 indicate that the deletion percentages for the keyboard do not follow a generalised 
trend, but fluctuate erratically between sessions. The eye gaze and speech interaction technique is somewhat 
more stable across the sessions. From Chart 7.6 it appears as though the percentage of deletion errors is 
comparable for the keyboard and speech-L. 
The following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the percentage of deletion errors made. 
2. H0,2: The session has no effect on the percentage of deletion errors made. 
Table 7.8 contains the results of the repeated-measures within-subjects ANOVA, the multivariate tests and the 
2
adjusted corrections since the assumption of sphericity was not met (χ (35) = 75.912, p < 0.05). 
 
 
167 
 
Chapter 7 Analysis of Typing Tasks 
 Table 7.7: Descriptive statistics for the deletion error percentage of keyboard and speech-L 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 17 21 8 13 
x̄  3.5 0.5 3.2 0 
s 4.1 1.5 3.6 0 
Session 3 n 20 20 8 13 
x̄  2.5 1.9 1.5 2.2 
s 3.7 3.9 2.5 4.7 
Session 4 n 19 21 8 13 
x̄  2.8 0.7 1.5 0.9 
s 4.0 1.6 2.8 2.0 
Session 5 n 21 21 8 13 
x̄  1.8 1.2 1.9 1.1 
s 2.7 2.9 3.0 3.4 
Session 6 n 20 21 8 13 
x̄  2.4 0.8 2.0 0.4 
s 4.3 1.7 4.1 0.7 
Session 7 n 18 20 8 13 
x̄  1.3 1.5 1.8 1.9 
s 2.9 3.2 4.1 3.9 
Session 8 n 18 18 8 13 
x̄  2.8 1.8 2.8 1.3 
s 4.7 2.5 4.5 2.2 
Session 9 n 18 18 8 13 
x̄  1.0 0.5 0.7 0.4 
s 1.5 1.1 1.0 1.0 
Session 10 n 17 21 8 13 
x̄  1.0 0.5 0.9 0.5 
s 2.0 0.8 2.1 0.8 
 
 
 
Chart 7.6: Mean deletion errors percentage of keyboard and speech-L 
168 
 
Chapter 7 Analysis of Typing Tasks 
Table 7.8: Analysis results for deletion error percentage of keyboard and speech-L 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 19) = 1.760,     
technique p > 0.05 
Session F(8, 152) = 0.809,  F(4.2, 79.1) = 0.809,  F(5.8, 109.3) = 0.809,  F(8, 12) = 1.437,  
p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Interaction F(8, 152) = 0.937,  F(4.2, 79.1) = 0.937,  F(5.8, 109.3) = 0.937,  F(8, 12) = 1.951,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
Neither of the null hypotheses could be rejected at an α-level of 0.05, which indicates that the interaction 
technique has no noticeable impact on the number of deletions which occur when transcribing text. Therefore, 
the deletions percentages are not affected by the interaction technique. 
 
7.5.1.3 Characters per second 
Characters per second (CPS) for all sessions, each participant and each interaction technique were calculated. 
The number of observations, the mean and standard deviation of these observations are tabulated below. 
 Table 7.9: Descriptive statistics for characters per second of keyboard and speech-L 
  All participants Participants completing all 
sessions 
  Speech Keyboard Speech Keyboard 
Session 2 n 18 21 10 13 
x̄  0.2 2.2 0.2 2.5 
s 0.1 1.1 0.1 1.2 
Session 3 n 20 20 10 13 
x̄  0.2 2.4 0.1 2.5 
s 0.1 1.0 0.1 1.2 
Session 4 n 21 21 10 13 
x̄  0.2 2.5 0.2 2.7 
s 0.1 1.0 0.1 1.2 
Session 5 n 21 21 10 13 
x̄  0.2 2.6 0.2 2.8 
s 0.1 1.2 0.1 1.4 
Session 6 n 20 21 10 13 
x̄  0.2 2.4 0.2 2.7 
s 0.1 1.0 0.1 1.1 
Session 7 n 20 20 10 13 
x̄  0.2 2.6 0.2 2.9 
s 0.1 1.1 0.1 1.2 
Session 8 n 18 18 10 13 
x̄  0.3 2.6 0.3 2.7 
s 0.1 1.0 0.1 1.0 
Session 9 n 19 19 10 13 
x̄  0.3 2.6 0.3 2.8 
s 0.1 0.9 0.1 1.0 
Session 10 n 21 21 10 13 
x̄  0.2 2.7 2.9 2.9 
s 0.1 0.9 0.1 1.0 
 
169 
 
Chapter 7 Analysis of Typing Tasks 
The chart below is a plot of the mean characters per session for each interaction technique over all sessions. 
 
Chart 7.7: Mean characters per second of keyboard and speech-L 
 
From the graph and the table it can be seen that when typing with the keyboard, participants were able to 
type at a faster rate than when using eye gaze and speech. The speed with which typing can be achieved using 
eye gaze and speech remains fairly constant throughout the sessions, displaying only mild improvement as the 
exposure increases. The following hypotheses were formulated: 
1. H0,1: The number of characters per second that can be typed is not influenced by the interaction 
technique. 
2. H0,2: The number of characters per second that can be typed is not influenced by the session in which 
the task was completed. 
2
The assumption of sphericity was not met (χ (35) = 136.334, p < 0.05), therefore Table 7.10 contains both the 
results of the repeated-measures ANOVA as well as the adjusted corrections that were required. To complete 
the analysis, the results of the multivariate tests are also reported. 
Table 7.10: Analysis results for characters per second of keyboard and speech-L 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(1, 21) = 54.704,     
technique p < 0.05 
Session F(8, 168) = 1.385,  F(3.6, 75.5) = 1.385,  F(4.6, 97.3) = 1.385,  F(8, 14) = 5.866,  
p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Interaction F(8, 168) = 0.660,  F(3.6, 75.5) = 0.660,  F(4.6, 97.3) = 0.660,  F(8, 14) = 3.105,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
The results in the table above show that H0,1 could be rejected at an α-level of 0.05. Therefore, typing speed, 
as measured by characters typed per second, is significantly slower than when using a keyboard. The fact that 
H0,2 could not be rejected shows that, although there is slight improvement between sessions, the 
improvement is not significant. 
 
170 
 
Chapter 7 Analysis of Typing Tasks 
7.5.2 Analysis of all typing tasks 
The previous section analysed the typing tasks which were completed using the keyboard and the original 
sized onscreen keyboard. As previously mentioned, there were four additional typing tasks added from the 
fifth session onwards. Two of these were performed with buttons which were smaller and closer spaced and 
the other two also with smaller buttons but spaced further apart. These two interaction technique will be 
referred to as Speech-SC and Speech-SW respectively. The original sized buttons will be referred to as the 
Speech-L interaction technique for the duration of the analysis. The same measurements as with the previous 
analysis will be analysed, namely error rate and characters per second. This will allow both the effectiveness 
and efficiency to be analysed. 
 
7.5.2.1 Error Rate 
The average error rate of each participant was calculated for each interaction technique and each session. The 
number of observations for each session, the mean and the standard deviation are tabulated below. 
 Table 7.11: Descriptive statistics for error rates of all interaction techniques 
  All participants Participants completing all sessions 
  Speech Keyboard Speech Speech Speech Keyboard Speech Speech 
– L – SW – SC – L – SW – SC 
Session 5 n 21 21 19 17 11 15 12 9 
x̄  14.2 6.5 15.1 15.3 13.2 5.7 15.5 14.7 
s 9.0 6.7 6.3 5.7 9.5 6.4 6.9 7.0 
Session 6 n 20 21 18 16 11 15 12 9 
x̄  12.8 5.0 16.3 17.1 11.5 5.3 17.7 16.7 
s 7.2 5.0 9.6 10.0 7.1 5.6 10.9 11.3 
Session 7 n 18 20 19 17 11 15 12 9 
x̄  12.2 4.9 15.6 13.5 10.3 5.7 13.3 12.3 
s 9.2 7.1 8.4 6.7 8.3 8.1 8.4 8.9 
Session 8 n 17 18 18 17 11 15 12 9 
x̄  11.9 5.8 13.1 15.1 13.3 4.8 11.2 14.5 
s 6.5 5.2 8.6 9.9 5.8 4.9 8.6 11.9 
Session 9 n 18 19 18 17 11 15 12 9 
x̄  9.3 3.4 12.5 15.5 7.7 3.0 12.9 17.8 
s 6.5 4.7 8.4 10.6 6.7 4.1 8.2 11.9 
Session 10 n 17 21 21 20 11 15 12 9 
x̄  9.3 3.9 14.2 13.4 8.9 4.4 10.3 12.7 
s 5.9 3.8 11.9 8.4 6.5 4.3 8.5 9.3 
 
Chart 7.8 is a plot of the means for the interaction techniques over all sessions. 
From Table 7.11 it can be determined that the keyboard had the lowest error rate of all interaction techniques 
for all sessions. Thereafter, speech-L had the next lowest error rate while the smaller buttons, both speech-SW 
and speech-SC, caused the highest error rates for all sessions. The latter two seem to cause approximately the 
same error rates while typing, however, the widely spaced buttons have an improved error rate during the 
later sessions while the error rates for the closely spaced buttons increased over the same period. 
The following hypotheses were formulated: 
1. H0,1: The error rate while typing is not affected by the interaction technique. 
2. H0,2: There is no difference between the sessions in the error rate while typing. 
171 
 
Chapter 7 Analysis of Typing Tasks 
 
 
Chart 7.8: Mean error rate for all interaction techniques 
 
2
The data met the condition of sphericity (χ (14) = 17.521, p > 0.05) which meant that no adjusted corrections 
had to be applied to the degrees of freedom. The required results for the analysis are shown in Table 7.12. 
 Table 7.12: Analysis results of error rates for all interaction techniques 
 ANOVA Multivariate 
Interaction technique F(3, 43) = 7.303,   
p < 0.05 
Session F(5, 215) = 2.530,  F(5, 39) = 2.599,  
p < 0.05 p < 0.05 
Interaction technique × Session F(15, 215) = 1.212,  F(15, 108) = 1.544,  
p > 0.05 p > 0.05 
 
Consequently, both null hypotheses could be rejected using an α-level of 0.05. Therefore, the interaction 
technique has a noticeable impact on the error rate while typing and the error rate differs significantly 
between sessions. 
Tukey’s post-hoc test was used to determine which interaction techniques were responsible for the significant 
difference. It was found that the keyboard differed significantly from both speech-SW and speech-SC. 
Therefore, when using either speech-SW or speech-SC participants had a significantly higher typing error rate 
than when using a keyboard. In this instance, it is encouraging to determine that speech-L does not differ 
significantly from the keyboard in these later sessions. This would seem to indicate that after some practice 
with the larger buttons, the number of errors made decreases. The same cannot be said of the smaller 
buttons. 
Session 6 differed significantly from session 10; in particular the error rates for speech-SW during session 6 
were significantly higher than the error rates for the keyboard for all sessions. Furthermore, session 9 of 
Speech-SC differed significantly from all sessions with the keyboard. 
The number of participants who were able to transcribe the text error-free was determined next. The results 
are shown in Chart 7.9. 
172 
 
Chapter 7 Analysis of Typing Tasks 
10 9
8
8 7
Speech - L
6 5
4 Keyboard
4 Speech - SW
2 2
2 1 1 1 1 1 1 1 1 Speech - SC
0 0 0 0 0 0 0 0 0
0
Session 5 Session 6 Session 7 Session 8 Session 9 Session 10
 
Chart 7.9: Error-free transcribed text for all interaction techniques 
The keyboard clearly outperformed the other interaction techniques in this regard. For each session there 
were at least two participants who transcribed the text error-free when using the keyboard. In contrast, the 
speech interaction techniques had either zero or only one error-free transcribed text string in each session. It 
was only in the final session, that speech-SC had a higher number of error-free transcribed text strings, but 
even then it was only two participants who could manage that feat. 
 
7.5.2.2 Breakdown of error rate 
Each of the error rates could be further subdivided into the percentage of insertion errors, the percentage of 
substitution errors and the percentage of deletion errors. The graph below is a stacked bar graph for the first 
task (first four stacks) and the last task (last four stacks). 
16.0
1.6 1.4
14.0
1.8
12.0 3.7 3.9 1.21.2
2.1
10.0 2.0
1.0
5.5
8.0
2.4 Deletion
6.0 Substitution0.6
0.9 9.9 10.1
8.8 9.4 Insertion 
4.0 0.5
5.9 6.2 0.9
4.4
2.0
2.6
0.0
First task   Last task  
Chart 7.10: Breakdown of first task and last task’s error rate for all interaction techniques  
173 
 
Number of participants 
Error rate 
Speech - L
Keyboard
SSppeeeecchh--SSWW
Speech-SC
Speech - L
Keyboard
Speech-SW
Speech-SC
Chapter 7 Analysis of Typing Tasks 
The percentage of insertion errors was the highest for all interaction techniques for both of these sessions. The 
interaction techniques of speech-SW and speech-SC have very similar distributions over the number of 
insertions, substitutions and deletions. In order to determine the significance of these distributions, each was 
analysed individually. The first of these was the percentage of insertion errors which will be discussed in the 
following section. 
 
7.5.2.2.1 Percentage of insertion errors 
Descriptive statistics for the insertion errors percentage of all the interaction techniques and for all sessions 
are summarised in Table 7.13. 
Table 7.13: Descriptive statistics for insertion errors percentage of all interaction techniques 
  All participants Participants completing all sessions 
  Speech- Keyboard Speech- Speech- Speech- Keyboard Speech- Speech-
L SW SC L SW SC 
Session 5 n 21 21 19 17 14 15 11 8 
x̄  5.9 4.4 9.9 10.1 6.1 3.6 8.7 7.7 
s 4.6 4.3 5.9 5.6 5.1 3.9 6.1 4.8 
Session 6 n 21 21 18 16 14 15 11 8 
x̄  5.7 3.7 10.6 13.2 5.0 4.2 11.4 9.1 
s 4.3 5.0 9.2 10.0 4.9 5.7 11.4 9.5 
Session 7 n 19 20 19 17 14 15 11 8 
x̄  5.7 2.7 11.4 9.3 6.2 2.7 7.9 5.5 
s 4.7 3.5 7.7 6.9 5.2 4.0 5.7 5.8 
Session 8 n 18 18 18 17 14 15 11 8 
x̄  7.9 3.7 9.3 10.8 7.6 2.6 5.5 9.7 
s 6.6 5.2 6.6 8.1 6.4 4.6 3.4 11.0 
Session 9 n 19 19 18 17 14 15 11 8 
x̄  6.2 2.4 10.2 9.7 5.7 2.3 8.1 7.2 
s 4.3 3.4 8.0 9.1 4.4 3.6 4.5 6.2 
Session 10 n 18 21 21 20 14 15 11 8 
x̄  6.2 2.6 8.9 9.4 6.4 3.0 6.9 6.3 
s 5.2 3.1 6.3 7.9 5.9 3.5 7.0 5.0 
 
Chart 7.11 is a plot of the mean percentage of insertion errors for all typing tasks. 
 
Chart 7.11: Mean insertion errors percentage for all interaction techniques 
174 
 
Chapter 7 Analysis of Typing Tasks 
The keyboard had the lowest percentage of insertion errors, followed by speech-L. Once again, the interaction 
techniques of speech-SW and speech-LC were, for the most part, barely distinguishable from each other. The 
following hypotheses were formulated: 
1. H0,1: The interaction technique has no effect on the percentage of insertion errors made. 
2. H0,2: There is no difference between the percentage of insertion errors made between the sessions. 
 
2
The assumption of sphericity was violated (χ (14) = 35.538, p < 0.05), therefore Table 7.14 contains the results 
of the adjusted corrections as well as the other required analyses. 
Table 7.14: Analysis results for insertion errors percentage of all interaction techniques 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(3, 44) = 4.100,     
technique p < 0.05 
Session F(5, 220) = 1.056,  F(3.9, 169.9) = 1.056,  F(4.6, 200.9) = 1.056,  F(5, 40) = 5.866,  
p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Interaction F(15, 220) = 1.002,  F(3.9, 169.9) = 1.002,  F(4.6, 200.9) = 1.002,  F(15, 111) = 0.811,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
The first null hypothesis could be rejected as the p-value was less than the α-level used. Tukey’s HSD post-hoc 
test indicated that the speech-SW interaction technique resulted in a significantly higher percentage of 
insertion errors than did the keyboard. There was no other significant difference between any other 
interaction techniques. The second null hypothesis could not be rejected; therefore there was no significant 
change in the percentage of insertion errors as the amount of exposure to the application increased. 
 
7.5.2.2.2 Percentage of substitution errors 
The number of included observations, the mean and the standard deviation of the percentage of substitution 
errors are tabulated below. 
Chart 7.12 is a plot of the mean percentage of substitution errors for all interaction techniques and all 
sessions. 
The percentage of substitution errors was lowest for the keyboard over all sessions. Of the interaction 
techniques which incorporated speech and eye gaze, speech-SW had the lowest mean percentage of 
substitution errors for the majority of the sessions. Apart from session 9, the percentage of substitution errors 
for speech-SC steadily improved as time went by. The following hypotheses were formulated to determine 
whether the differences noticed on Chart 7.14 and Table 7.15 were significant. 
1. H0,1: The interaction technique has no effect on the percentage of substitution errors made.  
2. H0,2: There is no difference between the percentages of substitution errors made between the 
sessions. 
 
When performing a repeated-measures within-subjects ANOVA it was found that there was significant 
interaction (F(15, 210) = 2.228, p < 0.05) between the factors of session and interaction technique. 
Subsequently, separate ANOVAs had to be performed in order that the influencing factors could be controlled 
for. 
 
175 
 
Chapter 7 Analysis of Typing Tasks 
 Table 7.15: Descriptive statistics for substitution errors percentage of all interaction techniques 
  All participants Participants completing all sessions 
  Speech- Keyboard Speech- Speech- Speech- Keyboard Speech- Speech-
L SW SC L SW SC 
Session 5 n 20 21 19 17 12 15 11 8 
x̄  5.5 0.9 3.7 3.9 3.3 0.7 3.4 3.4 
s 6.1 1.8 4.9 4.3 3.4 1.6 4.8 4.3 
Session 6 n 20 21 18 16 12 15 11 8 
x̄  3.3 0.5 3.9 3.3 1.0 0.6 3.4 3.0 
s 4.9 1.0 3.5 5.7 2.0 1.0 2.6 5.3 
Session 7 n 20 20 19 17 12 15 11 8 
x̄  4.9 0.8 1.9 2.6 4.6 1.0 1.2 2.3 
s 5.2 1.8 2.8 3.3 5.0 2.0 1.5 2.3 
Session 8 n 18 18 18 16 12 15 11 8 
x̄  2.2 0.3 1.3 1.4 1.6 0.3 0.8 0.5 
s 2.5 0.8 2.0 2.3 1.6 0.8 1.1 0.9 
Session 9 n 18 19 18 17 12 15 11 8 
x̄  1.7 0.5 1.5 4.1 1.1 0.4 1.1 4.3 
s 2.5 1.3 2.9 5.7 1.7 1.2 1.7 5.7 
Session 10 n 18 21 20 19 12 15 11 8 
x̄  2.4 0.9 2.0 2.1 1.6 1.0 0.8 1.2 
s 3.4 1.7 3.1 3.0 3.5 1.9 1.3 1.9 
 
 
 
Chart 7.12: Mean substitution errors percentage of all interaction techniques 
 
H0,1 could be rejected for session 5 (F(3, 73) = 3.690, p < 0.05), where the keyboard had a significantly lower 
percentage of substitution errors than speech-L. Similarly, H0,1 could be rejected for session 6 (F(3, 71) = 2.862, 
p < 0.05) as the keyboard had a significantly lower percentage of substitution errors than all other interaction 
techniques. During session 7 (F(3, 72) = 5.040, p < 0.05) when participants used the speech-L interaction 
technique they had a significantly higher percentage of substitution errors than when using the speech-SW 
interaction technique and the keyboard. The keyboard also resulted in lower percentages of substitution 
errors than the speech-SC interaction technique during session 9 (F(3, 68) = 3.442, p < 0.05). The null 
176 
 
Chapter 7 Analysis of Typing Tasks 
hypothesis could not be rejected for either session 8 (F(3, 66) = 2.671, p > 0.05) or 10 (F(3, 74) = 1.113, p > 
0.05) at an α-level of 0.05. 
 
The second null hypothesis could be rejected at α-level of 0.05 for speech-L and speech-SW. The results of all 
the analyses to investigate H0,2 are tabulated below. 
 
Table 7.16: Analysis results of substitution errors percentage for all interaction techniques 
 Mauchley’s ANOVA Geisser-Greenhouse Huyn-Feldt 
2
Speech-L Χ (14) = 18.507,  F(5, 55) = 2.824,    
p > 0.05 p < 0.05 
2
Keyboard Χ (14) = 33.646,  F(5, 70) = 0.738,  F(2.7, 37.7) = 0.738,  F(3.4, 47.5) = 0.738,  
p < 0.05 p > 0.05 p > 0.05 p > 0.05 
2
Speech-SC Χ (14) = 31.840,  F(5, 35) = 1.400,  F(2.7, 18.6) = 1.400,  F(4.4, 31.1) = 1.400,  
p < 0.05 p > 0.05 p > 0.05 p > 0.05 
2
Speech-SW Χ (14) = 22.982,  F(5, 50) = 3.285,    
p > 0.05 p < 0.05 
 
From the table it can be concluded that H0,2 could be rejected for speech-L and speech-SW. For the large 
buttons, the percentage of substitution errors was, on average, lower for session 6 than for session 7. When 
using eye gaze and speech with the smaller buttons which were widely spaced, session 5 and session 6 differed 
significantly from sessions 7 to 10. The latter sessions had a lower average than the first two sessions which 
indicated that when using this interaction technique there was some measure of learning as the exposure to 
the application increased. 
 
 
7.5.2.2.3 Deletion errors percentage 
Table 7.17 contains a summary of the number of observations, mean and the standard deviation of the 
deletion errors percentage.  
 Table 7.17: Descriptive statistics of deletion errors percentage for all interaction techniques 
  All participants Participants completing all sessions 
  Speech- Keyboard Speech- Speech- Speech- Keyboard Speech- Speech-
L SW SC L SW SC 
Session 5 n 21 20 19 17 9 14 11 9 
x̄  1.8 0.6 1.6 1.4 2.4 0.6 2.1 1.9 
s 2.7 1.6 2.4 2.1 3.2 1.5 2.5 2.1 
Session 6 n 17 21 17 16 9 14 11 9 
x̄  0.7 0.8 1.3 0.6 0.2 0.5 1.0 0.5 
s 1.3 1.7 2.1 2.0 0.6 1.1 1.9 0.9 
Session 7 n 17 19 18 16 9 14 11 9 
x̄  0.7 0.8 1.5 1.1 0.5 1.1 2.2 0.9 
s 1.4 1.7 2.3 2.2 1.0 1.9 2.5 1.4 
Session 8 n 18 18 18 17 9 14 11 9 
x̄  2.8 1.8 2.5 2.2 2.3 1.8 3.5 1.4 
s 4.7 2.5 4.6 2.9 4.4 2.6 5.6 2.8 
Session 9 n 18 19 18 17 9 14 11 9 
x̄  1.0 0.5 0.9 1.7 0.2 0.4 0.8 1.9 
s 1.5 1.1 1.6 2.6 0.5 1.0 1.5 3.1 
Session 10 n 17 21 20 20 9 14 11 9 
x̄  1.0 0.5 1.2 1.2 0.8 0.5 0.8 1.2 
s 2.9 0.8 2.4 2.2 2.0 0.7 2.5 2.0 
177 
 
Chapter 7 Analysis of Typing Tasks 
The chart below is a plot of the mean for all interaction techniques across all sessions. 
 
Chart 7.13: Mean deletion errors percentage for all interaction techniques 
 
Inspection of Table 7.17 and Chart 7.13 shows that the number of deletions is approximately the same for all 
interaction techniques and across all sessions. Apart from session 8, which shows a sharp spike in the number 
of deletions, the deletion errors percentage remains fairly stable throughout. The following hypotheses were 
formulated: 
1. H0,1: The interaction technique has no effect on the percentage of deletion errors made. 
2. H0,2: There is no difference between the percentages of deletion errors made between the sessions. 
The assumption of sphericity which is required for a repeated-measures, within-subjects ANOVA was not met 
2
(χ (14) = 67.342, p < 0.05). This required additional analyses to be performed on the data. The results of all the 
tests are tabulated below. 
Table 7.18: Analysis results of deletion errors percentage for all sessions 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(3, 39) = 1.638,     
technique p > 0.05 
Session F(5, 195) = 3.450,  F(2.6, 103.2) = 3.450,  F(3.1, 120.0) = 3.450,  F(5, 15) = 4.221,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(15, 195) = 0.766,  F(7.9, 103.2) = 0.766,  F(9.2, 120.0) = 0.766,  F(15, 97) = 0.491,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
With a confidence interval of 95%, H0,1 could not be rejected but H0,2 could be rejected. As could be expected, 
Tukey’s HSD post-hoc test indicated that session 8 differed significantly from sessions 6, 9 and 10. Session 8 
had a much higher average deletion percentage than the other sessions. 
 
178 
 
Chapter 7 Analysis of Typing Tasks 
7.5.2.3 Characters per second 
The typing speed, measured as characters per second, was calculated for each participant and each typing 
task. The average per session was then calculated for each interaction technique. The descriptive statistics for 
this measurement are summarised in the table below. 
 Table 7.19: Descriptive statistics of characters per second for all interaction techniques 
  All participants Participants completing all sessions 
  Speech- Keyboard Speech- Speech- Speech- Keyboard Speech- Speech-
L SW SC L SW SC 
Session 5 n 21 19 19 17 14 13 12 9 
x̄  0.2 2.2 0.2 0.2 0.2 2.4 0.2 0.2 
s 0.1 0.7 0.1 0.1 0 0.8 0 0.1 
Session 6 n 20 21 18 16 14 13 12 9 
x̄  0.2 2.4 0.2 0.3 0.2 2.3 0.3 0.3 
s 0.1 1.0 0.1 0.1 0.1 0.7 0.1 0.1 
Session 7 n 20 19 19 17 14 13 12 9 
x̄  0.2 2.5 0.2 0.3 0.2 2.5 0.3 0.3 
s 0.1 0.7 0.1 0.1 0.1 0.6 0.1 0 
Session 8 n 18 18 18 17 14 13 12 9 
x̄  0.3 2.6 0.3 0.3 0.3 2.6 0.3 0.3 
s 0.1 1.0 0.1 0.1 0.1 0.8 0.1 0.1 
Session 9 n 19 19 18 17 14 13 12 9 
x̄  0.3 2.6 0.3 0.3 0.3 2.5 0.3 0.3 
s 0.1 0.9 0.1 0.0 0.1 0.7 0.1 0 
Session 10 n 21 21 21 20 14 13 12 9 
x̄  0.2 2.7 0.3 0.3 0.3 2.7 0.3 0.3 
s 0.1 0.9 0.1 0.1 0.1 0.7 0.1 0.1 
 
Chart 7.14 shows a plot of the means for the characters per second measurement of all interaction techniques 
across all sessions. 
 
 
Chart 7.14: Mean characters per second for all interaction techniques 
When using the keyboard, participants were clearly able to type at a much faster rate than when using eye 
gaze and speech with the onscreen keyboard. With reference to Table 7.19 and Chart 7.15, it appears that the 
179 
 
Chapter 7 Analysis of Typing Tasks 
size and spacing of the buttons did not affect the speed at which typing could occur. The underlying 
hypotheses were formulated to analyse this statistically: 
1. H0,1: There is no difference between the number of characters per second that can be typed using the 
different interaction techniques. 
2. H0,2: The session has no effect on the number of characters per second that can be typed. 
2
The assumption of sphericity was violated (χ (14) = 76.146, p < 0.05), therefore Table 7.20 below shows the 
results of the ANOVA, the multivariate tests and the corrected adjustments required. 
Table 7.20: Analysis results of characters per second for all interaction techniques 
 ANOVA Geisser-Greenhouse Huyn-Feldt Multivariate 
Interaction F(3, 44) = 148.369,     
technique p < 0.05 
Session F(5, 15) = 3.002,  F(3.3, 147.4) = 3.002,  F(3.9, 171.8) = 3.002,  F(5, 40) = 2.563,  
p < 0.05 p < 0.05 p < 0.05 p < 0.05 
Interaction F(15, 220) = 0.845,  F(10.0, 147.4) = 0.845,  F(11.7, 171.8) = 0.845,  F(5, 110) = 1.264,  
technique × p > 0.05 p > 0.05 p > 0.05 p > 0.05 
Session 
 
Both null hypotheses could be rejected at an α-level of 0.05. Therefore, the number of characters typed per 
second differs between the interaction techniques and sessions. Tukey’s HSD shows that the keyboard yields a 
significantly faster typing rate than all other interaction techniques. Session 10 also yielded significantly faster 
typing speeds than sessions 5 and 6. 
 
7.5.3 Summary of results 
It was found that the eye gaze and speech interaction technique had a significantly higher error rate than that 
of the keyboard, undoubtedly as a result of a higher number of insertions and substitutions. This may serve as 
confirmation that even when using eye gaze and speech as a text input mechanism, the user is inclined to 
glance away before completing the issuing of the verbal command. This finding corresponds to that of the 
increased number of missed clicks during ISO testing of the interaction techniques. The average insertions are 
generally higher than the substitutions which would seem to indicate that users are aware that they have 
activated the incorrect character and attempt to correct it by inserting the correct character. This is 
encouraging as it indicates that users become familiarised with the system such that they can interpret the 
selection (indicated by audio feedback) and are able to make corrections to text entry. Further research could 
confirm these suppositions by capturing the correction of the text input as well so that it can be analysed to 
determine whether incorrect inputs are reversed/erased before text input is continued. Whether the buttons 
are large, small and widely spaced or small and closely spaced seems to be of little consequence. There was no 
difference between the error rates of these three interaction techniques and they all differed from the 
keyboard at some stage. However, the interaction technique of speech-L did seem to offer the most improved 
error rate as it did not differ from the keyboard when analysed for the later sessions only. In some instances 
there was improvement over the sessions, which indicates some measure of learning when using the 
interaction technique. If the learning effect can be maintained then more practice with the eye gaze and 
speech could eventually lead to an effectiveness measurement which is comparable to that of the keyboard. 
In terms of efficiency, the keyboard also outperformed all the eye gaze and speech interaction techniques with 
significantly higher numbers of characters per second which could be typed. The typing speed of the eye gaze 
and speech also did not improve as exposure increased. This could indicate that either more practice is needed 
180 
 
Chapter 7 Analysis of Typing Tasks 
to achieve increased speeds or that the typing speed quickly reaches the fastest achievable rate. Neither the 
size of the buttons nor the spacing between buttons affected the efficiency of the eye gaze and speech. 
Therefore, in terms of effectiveness and efficiency, the three eye gaze and speech interaction techniques seem 
fairly interchangeable as they perform on comparable levels to each other. The keyboard is far more effective 
and efficient than any of the eye gaze and speech interaction techniques when used for text input. 
There are no similar studies with which to compare the results found during this study. However, the fact that 
speech outperforms keyboard input for young children (Read et al., 2001) indicates that the learning curve for 
keyboard entry is fairly steep. This could be the same for text entry with eye gaze and speech. Although there 
was no significant improvement in the speed of the text entry, participants clearly became more comfortable 
with the use of the interaction technique. Therefore, extended practice may be required to improve speeds.  
The mean entry rate of eye gaze and speech fell within the range between 0.2 and 0.3 characters per second. 
Considering that the entry rate was relatively low for context switching at 12 WPM (Morimoto & Amir, 2010) 
and 9 WPM for symbol creator (Miniotas et al., 2003), the range in this study was comparable to these 
previous studies. A previous study showed that the use of both visual and auditory feedback increased the 
entry speed to 7.55 WPM which is lower than the speeds achieved in this study. Speech Dasher achieved much 
higher speeds (40 WPM), while using Dasher with eye gaze also resulted in higher speeds (17 WPM). 
Therefore, when comparing the text entry method to studies using only eye gaze without text predictors, 
speech and eye gaze performs slightly better. However, the speeds are still lower than using text prediction 
methods and when using speech as an activator. While these comparisons are promising since they indicate 
that speech and eye gaze could facilitate faster entry speeds than using eye gaze only, they are discussed with 
caution since the text entered in the current study required only a few short phrases to be entered and more 
prolonged use could have an impact on the entry speed.  
 
7.6 Further research 
Further research can be conducted in terms of which the participants receive more practice with using eye 
gaze and speech as a text input mechanism. This will allow more detailed analysis to be performed in order to 
determine whether a much longer period of exposure would serve to increase the effectiveness and efficiency 
of the interaction technique. Furthermore, future studies could incorporate the correction of errors so that the 
character error rate could determine the eventual correctness of the transcribed text in conjunction with the 
transcribed text before corrections were applied. 
Since it was found that neither the size of the buttons nor the spacing between the buttons influenced the 
usability of the interaction technique, further tests can be conducted to determine whether an increase in the 
gravitational well will impact performance. Although the decrease of physical size and increase of gravitational 
well result in a selectable area with the same size as a large button, the perceived accuracy with smaller 
buttons could serve to boost the confidence, and therefore satisfaction, of end-users. 
 
7.7 Summary 
This chapter reported on the results of the use of eye gaze and speech for text input when compared to a 
traditional keyboard. Measurements of efficiency, namely characters typed per second, and effectiveness, 
namely the character error rate, were analysed. It was found that when using eye gaze and speech for text 
input, neither the size of the buttons nor the spacing between the buttons affected the performance of the 
interaction technique. The performance of the keyboard for both of these usability measures far outperforms 
that of the eye gaze and speech. Even with extended exposure to the eye gaze and speech interaction 
181 
 
Chapter 7 Analysis of Typing Tasks 
techniques, the effectiveness and efficiency could not reach levels which were equivalent to those achieved by 
the keyboard. 
The next chapter will discuss the results of the questionnaires designed to elicit the subjective usability 
measurement of satisfaction. Furthermore it will also discuss the subjective feelings of participants towards 
using the combination of eye gaze and speech to simulate a pointing device. 
 
 
  
182 
 
 
CHAPTER 8 
PARTICIPANT SUBJECTIVE SATISFACTION 
 
8.1 Introduction 
The previous two chapters analysed a number of objective usability measurements. These included efficiency, 
effectiveness and learnability measurements of using speech commands as well as typing using the interaction 
technique of eye gaze and speech. Speech commands were found, in some instances, to be comparable or 
even more efficient or effective than the traditional means of using the keyboard and mouse to complete the 
same task. Where typing was concerned it was found that the keyboard was far more efficient and effective 
than eye gaze and speech and that even with continued use, eye gaze and speech could not compete with the 
keyboard. 
Even with this objective evidence it is important also to measure the subjective response of participants with 
regard to the use of the application. Subjective feelings were captured through the use of an extensive 
questionnaire (Appendix H) which was administered during the first and last sessions. Informal interviews 
were also conducted with the participants after each session. The facilitator also unobtrusively observed the 
participants during each session and made notes on any behaviour that was deemed noteworthy. This chapter 
reports on the results of these questionnaires, observations and interviews. 
 
8.2 Procedure 
Questionnaires were administered to participants of the study in order to gauge their subjective satisfaction 
with the application. Therefore, the sample was the same as for the previous two chapters. There was, 
however, a single participant who did not attend her last session and therefore did not complete the 
questionnaire in the final session. Therefore, the analysis in this chapter is based on the responses of the 
remaining 24 participants. 
At the end of the very first session which contained the introduction to the application and the informal 
interaction with the application, the participants were required to complete the questionnaire as outlined in 
Appendix G. After their exposure to the application over the course of the ten week longitudinal testing, the 
extended questionnaire as contained in Appendix H was completed by the participants. These questionnaires 
were designed to elicit overall satisfaction with the application, the subjective feelings towards using eye gaze 
and speech to effectively point at objects and in the case of the second questionnaire, satisfaction with the 
typing aspect of the application as well as the speech commands which could be used. The questionnaires 
were compiled from the pointing device assessment questionnaire as advocated by ISO 9241-9 as well the 
Questionnaire for User Interaction Satisfaction (Chin, Diehl & Norman, 1988). 
The ISO questionnaire consisted of 9 questions, each of which could be ranked on a 5-point scale. For analysis 
purposes, the 5-point scale was divided into three categories. The first category consisted of the two lowest 
rankings, the second category consisted of the neutral category only and the third category consisted of the 
two highest rankings. Although there may be a danger that respondents simply choose the neutral category as 
the least controversial choice, it was decided to still view this as a separate category and not to group it with 
the negative responses as it could not be guaranteed that a negative response was indeed the intention of the 
respondent. 
183 
 
Chapter 8 Participant Subjective Satisfaction 
In order to measure user satisfaction, an extract of the Questionnaire for User Interaction Satisfaction (QUIS) 
was used. These questions were then posed based on the system as a whole, the response to the speech 
commands and the response to the typing. A 5-point qualified Likert scale with explicit adjectives on either 
side of the scale was used as a response scale for all questions. The five-point response scale was numbered 1 
– 5 for analysis purposes and responses were grouped into a negative category which consisted of the two 
lowest points on the scale, neutral which was the midpoint and a positive response which was any response to 
the two highest points of the scale. As advocated by Harper and Norman (1993), each subsection could be 
given an overall score for each participant by calculating the average score for that subsection. 
 
8.3 Reaction to the application 
 
8.3.1 Satisfaction 
Both the first and second administered questionnaires contained sections to gauge the overall reaction of the 
participants to the application (see Appendix G and H, Part 3). The responses to each of the questions were 
categorised into low usability, neutral usability and high usability, as specified in the previous section. The 
number of responses in each category was determined after which a contingency table was created for each 
question. For example, the contingency table for the first question looked as follows (Yates correction was 
applied during analysis where necessary): 
 Table 8.1: Example contingency table for overall satisfaction 
 Terrible Neutral Wonderful 
Responses = 1 or 2 Responses = 3 Responses = 4 or 5 
First session 1 8 16 
Last session 0 7 17 
 
Thereafter a Chi-square test was used to evaluate the underlying hypothesis for each question.  
H0: There is no difference between satisfaction after the first exposure to the application and after 
extended use of the application. 
Table 8.2 contains descriptive statistics for each question as well as the result of the Chi-square test for that 
question. Scores are on a scale of 1 to 5, with a midpoint of 3. The first column contains the scale adjectives 
that were used for that particular question. 
The null hypothesis could not be rejected for any of the questions at an α-level of 0.05 which means that user 
reaction to the application was not significantly altered over the course of the ten weeks. The null hypothesis 
for the second question (ranging from frustrating to satisfying) could however be rejected at an α-level of 0.10. 
Since the overall mean decreased from the first to the last session it would indicate that the level of frustration 
was slightly higher at the end of the ten week period. In fact, the mean for all the questions was lower after 
the last session than after the first session but not significantly so. For example, after ten weeks the 
participants found the application less stimulating than after the first exposure. This could be due to the fact 
that they had learnt to use the system and found it less stimulating once they had mastered the use of the 
multimodal interface. However, the adequacy and rigidity measurements were also slightly lower which could 
indicate that once they had explored the available options they felt they needed more freedom than was 
offered by the available grammar and onscreen typing. The fact that the application is highly customisable and 
184 
 
Chapter 8 Participant Subjective Satisfaction 
extendable could offer a solution to this and perhaps should have been pointed out to the participants even if 
they did not get the opportunity to make use of these features. 
Table 8.2: Descriptive statistics for each satisfaction question for the application 
 First session Last session Chi-square test 
 Mode Mean Standard Mode Mean Standard  
Deviation Deviation 
Terrible – 4 3.8 0.9 4 3.8 0.6 χ2(2) = 0.02 *, 
Wonderful p > 0.05 
2
Frustrating – 4 3.6 1.0 3 3.0 1.0 χ (2) = 4.9, 
Satisfying  0.05 < p < 0.10 
2
Dull – 4 4.1 0.8 4 3.9 1.0 χ (2) = 2.4, 
Stimulating p > 0.05 
2
Difficult – Easy 3 3.4 1.2 3 3.2 0.9 χ (2) = 0.4, 
p > 0.05 
2
Inadequate – 3 3.6 0.9 3 3.4 0.8 χ (2) = 0.6, 
Adequate p > 0.05 
2
Rigid – 4 3.6 0.9 4 3.4 0.8 χ (2) = 2.3, 
Flexible p > 0.05 
* Yates corrected Chi-square applied 
The fact that the mean for each of the questions rated on the high side for both sessions is encouraging in 
terms of the subjective satisfaction experienced by the participants. 
Each subsection of the QUIS can be given a score based on the responses of the participants (Shneiderman, 
1998). Following the example of other studies (cf., Tullis & Stetson, 2004), a score for this subsection was given 
to each participant. This was calculated as the mean of the responses for the six questions for each participant 
and for each administration of the questionnaire. Table 8.3 below summarises the descriptive statistics for 
both sessions, which are scored on a scale of 1 – 5: 
 Table 8.3: Descriptive statistics for overall satisfaction with application 
 First session Last session 
Mean 4.7 4.5 
Standard Deviation 0.7 0.7 
 
The average satisfaction rating was lower after the last session than after the first session, although there is a 
satisfactory high usability measurement achieved for both sessions. The following hypothesis was formulated 
to determine if the detected difference was significantly different: 
H0: There is no difference between the overall satisfaction of the participants after the first exposure 
to the application and after extended use of the application. 
A paired t-test was used to determine if the opinions of the participants had changed after they could interact 
with the application over an extended period. The null hypothesis of no difference could not be rejected (t = 
1.74, df = 23), therefore there was no difference in the overall satisfaction of the participants between the first 
and the last session. 
Although there was no difference between the sessions, it is encouraging to note that the overall reaction to 
the system was either neutral or positive. No question was rated on the negative side which indicates that, in 
general, the participants viewed their experience with the application as pleasing and satisfying. 
185 
 
Chapter 8 Participant Subjective Satisfaction 
8.3.2 Learnability 
The learnability subsection of the questionnaire had four questions (Appendix H, Part 6). Responses after the 
first session were based on the perceived learnability after a short interaction with the system, while the 
responses after the last session were based on experience with the system and how the participants 
experienced their own learning curve. The same categorisation as in the prior section was used to categorise 
the responses for learnability into low, neutral and high learnability (see Table 8.4 for an example).  
 Table 8.4: Example contingency table for overall learnability 
 Difficult Neutral Easy 
First session 8 4 13 
Last session 6 5 13 
 
The following hypothesis was formulated for this purpose: 
H0: There is no difference between subjective feelings of learnability after the first exposure to the 
application and after extended use of the application. 
The afore-mentioned hypothesis was analysed for each question in the learnability subsection and then also 
for the overall learnability of the application. Descriptive statistics and results of the Chi-square test for each of 
the questions are tabulated below. The first column contains the scale that was used as well in the particular 
facet of learnability that was targeted by the question. 
Table 8.5: Descriptive statistics for learnability questions for the application 
 First session Last session Chi-square 
test 
 Mode Mean Standard Mode Mean Standard  
deviation deviation 
2
Difficult – Easy (Overall 4 3.3 1.2 4 3.5 1.3 χ (2) = 0.4, 
learning) p > 0.05 
2
Difficult – Easy (Getting 3 3.3 1.2 4 3.1 1.4 χ (2) = 8.7, 
started) p < 0.05 
2
Difficult – Easy 2 3.0 1.1 4 3.3 1.1 χ (2) = 1.9, 
(Learning advanced p > 0.05 
features) 
2
Slow – Fast (Time to 3 3.3 1.2 4 3.5 1.4 χ (2) = 3.0, 
learn) p > 0.05 
 
The null hypothesis could only be rejected for the question aimed at eliciting an opinion about the ease with 
which a user can make use of the application from the very start. After the first session the majority of the 
participants felt neutral about this aspect of the application but after extended use the majority felt that it was 
relatively easy. However, closer inspection of the spread of the responses shows that while some respondents 
moved from neutral to easy, there were also 5 more that felt the initial learning curve was steeper in 
retrospect than after their first exposure. 
Each participant was given an overall score, calculated as an average of their four responses, as a measure of 
the learnability of the application after both the first and the last sessions. The table below is a summary of the 
descriptive statistics for the learnability of the application. Scores are ranked on a scale of 1 – 5. 
186 
 
Chapter 8 Participant Subjective Satisfaction 
 Table 8.6: Descriptive statistics for overall learnability of the application 
 First session Last session 
Mean 4.1 4.3 
Standard Deviation 1.0 1.1 
 
A paired t-test was conducted to determine if there were any differences between the two sessions. At an α-
level of 0.05, the afore-mentioned null hypothesis could not be rejected (t = 1.235, df = 23). Learnability of the 
application, both in the short- and long-term, was rated positively by the majority of the participants, 
particularly after the last session. 
 
8.4 Typing 
 
8.4.1 Satisfaction 
The questionnaire administered after the final session contained questions aimed specifically at testing user 
satisfaction with regard to typing using the onscreen keyboard and eye gaze and speech. The same six Likert 
(Olivier, 2004) scales with explicit adjectives were provided as with the questions for the whole application.  An 
extra question was added to measure the naturalness of typing using eye gaze and speech with an onscreen 
keyboard.  
In order to determine whether the observed number of participants for each one of the categories is 
significantly different from an even distribution, a contingency table as in Table 8.7 was set up for each one of 
the six questions. 
 Table 8.7: Example contingency table for Chi-square test 
 Terrible Neutral Wonderful 
Observed responses 4 7 13 
Expected even distribution 8 8 8 
 
The following null hypothesis was formulated: 
H0: The responses in the possible categories are evenly distributed. 
Descriptive statistics for the responses, together with the results of the respective Chi-square tests, are 
summarised in Table 8.8 (scores are reported on a scale of 1 – 5). 
Chart 8.1 below shows a stacked bar chart for the number of responses in each category. The numbers in the 
stacks indicate the number of responses in that category.  
The null hypothesis could not be rejected for any of the questions at an α-level of 0.05, but that of the dull – 
stimulating measurement could be rejected at an α-level of 0.1. This means that significantly more 
respondents experienced the application as being stimulating than those who experienced it as dull.  The 
mean score for the majority of these questions is above the midpoint of the scale which suggests that the 
experience of typing was a fairly pleasant one for most of the participants. The subjective feelings of difficult 
and frustrating had a number of responses on the low side of the scale. This is understandable as it takes some 
practice for the participants to become used to typing using eye gaze and speech and even after some practice 
it could still be considered challenging to maintain a stable gaze on a button long enough to issue the 
187 
 
Chapter 8 Participant Subjective Satisfaction 
command required to type the letter to the document. It could also be frustrating for the users in the sense 
that sometimes they would still glance away whilst issuing the command, thereby preventing the desired letter 
to be typed to the document. This could also account for the fair number of responses in the neutral category 
of the rigid – flexible scale. Unfortunately, the feeling of naturalness experienced during interaction is below 
the midpoint which indicates that the majority of participants found this method of interaction unnatural to 
some degree. The novelty of the interaction technique could play a significant role in this mindset and it may 
be necessary to increase the level of exposure and practice with the application before it could possibly match 
the naturalness of traditional keyboard typing. 
 
 Table 8.8: Descriptive statistics for satisfaction questions for the typing feature 
 Mode  Mean Standard Chi-square test 
deviation 
2
Terrible – Wonderful 4 3.5 1.0 χ (2) = 2.6, 
p > 0.05 
2
Frustrating – Satisfying  3 3.1 1.1 χ (2) = 0.1, 
p > 0.05 
2
Dull – Stimulating 3 3.7 1.1 χ (2) = 4.8, 
0.05 < p < 0.10 
2
Difficult – Easy 3 3.2 1.2 χ (2) = 0.2, 
p > 0.05 
2
Inadequate – Adequate 3 3.6 0.9 χ (2) = 3.1, 
p > 0.05 
2
Rigid – Flexible 3 3.3 1.2 χ (2) = 0.8, 
p > 0.05 
2
Unnatural – Natural 2 2.8 1.3 χ (2) = 0.8, 
p > 0.05 
 
 
Terrible - Wonderful 4 7 13
Frustrating - Satisfying 7 9 8 Category 
Dull - Stimulating 2 8 13 Negative
Difficult - Easy 7 7 9 Neutral
Positive
Inadequate - Adequate 3 8 12
Rigid - Flexible 5 9 9
Unnatural - Natural 11 6 7
 
Chart 8.1: Number of responses in each category of the typing feature satisfaction questions 
 
 
188 
 
Chapter 8 Participant Subjective Satisfaction 
8.4.2 Learnability 
The learnability of typing using eye gaze and speech was measured using the same four questions as for the 
overall usability of the application. Once again, the same grouping per category was used for the responses. 
Since there was only one measurement available, the average number of participants was used as a secondary 
measurement (see Table 8.7 in previous section for an example) in order to analyse the underlying hypothesis. 
H0: The responses in the possible categories are evenly distributed. 
Descriptive statistics and the results of the Chi-square test for the learnability of typing are summarised in 
Table 8.9. Scores are measured on a scale of 1 – 5 and the first column contains the adjectives used on the 
scale for the particular question. 
 Table 8.9: Descriptive statistics for learnability questions for the typing feature 
 Mode Mean Standard Chi-square 
deviation test 
2
Difficult – Easy (Overall learning) 3 3.2 1.4 χ (2) = 0.8, 
p > 0.05 
2
Difficult – Easy (Getting started) 3 3.0 1.3 χ (2) = 0.4, 
p > 0.05 
2
Difficult – Easy (Learning advanced 3 3.3 1.2 χ (2) = 1.0, 
features) p > 0.05 
2
Slow – Fast (Time to learn) 5 3.8 1.3 χ (2) = 5.7, 
0.05 < p < 0.10 
 
The null hypothesis could not be rejected for any of the questions which indicates that the opinion of the 
participants is not significantly skewed to any of the categories. However, interpretation of the results using an 
α-level of 0.10 allows the null hypothesis to be rejected for the time to learn to use the application when 
measured on a scale of slow to fast. 
The chart below is a stacked bar graph for the number of responses in each category. The numbers in each 
stack indicate the number of responses in that particular category. 
 
Difficult - Easy (Learning) 6 7 11
Category 
Difficult - Easy (Getting 9 6 9 Difficult
started)
Neutral
Difficult - Easy (Learning 5 9 10 Easy
advanced)
Slow - Fast (Time to learn) 4 3 15
 
 Chart 8.2: Number of responses in each category of the typing feature learnability questions 
189 
 
Chapter 8 Participant Subjective Satisfaction 
From the graph it can be seen that the vast majority of the participants felt that learning to type with the 
onscreen keyboard was fast. Getting started with the typing was equally considered to be easy and difficult 
since both categories had 9 responses. This could possibly be attributed to the fact that some users have a 
predisposition to embrace new advancements and find it easy and quick to get into the habit of using the new 
features while others may balk at the idea of having to change the way they interact with an application. In 
general, it seems as though participants found that learning to type with eye gaze and speech was relatively 
easy. 
 
8.4.3 Preference and ease of use for typing se ttings
During longitudinal testing, participants used three different configurations for the onscreen keyboard, namely 
(1) large buttons, (2) smaller buttons which were widely spaced and (3) smaller buttons which were spaced 
closer together. Participants were asked to rank both their preference (Appendix H, question 21) of these 
three configurations as well as the order in which they found them easiest to use (Appendix H, question 22). 
Since subjective preference does not always mirror objective performance, it was decided to pose both these 
questions to determine if the preference of the participants differed from the ease with which they could use 
the onscreen keyboards. 
Charts 8.3 and 8.4 below respectively show stacked bar graphs for the preference ranking and the ease of use 
ranking. The numbers indicate the number of responses in each category. The blue bar indicates the number 
of respondents who ranked the specific keyboard setup as their first preference, the red is their second 
preference and green the third most preferred setup. 
 
Smaller, closer together 5 9 10
Preference 
First
Second
Smaller, widely spaced 15 7 2
Third
Large buttons 4 9 11
 
Chart 8.3: Preference ranking of the onscreen keyboard setups 
 
190 
 
Chapter 8 Participant Subjective Satisfaction 
Smaller, closer together 3 13 8
Preference 
First
Second
Smaller, widely spaced 15 6 3
Third
Large buttons 6 6 12
 
Chart 8.4: Ease of use ranking for the onscreen keyboard settings 
 
The charts clearly indicate that preference is highest for the smaller widely spaced buttons and that the 
majority of the participants also found these the easiest to use. The larger buttons were the least preferred 
and were also judged to be the least usable by the majority of the participants. The following is the 
contingency table used for the participant preference: 
 Table 8.10: Contingency table for keyboard setup preference 
 First Second Third 
preference preference preference 
Small, closer together 5 9 10 
Small, widely spaced 15 7 2 
Large buttons 4 9 11 
 
The following hypotheses were formulated for this analysis: 
1. H0,1: Participants’ preference is independent from the keyboard setup. 
2. H0,2: The perceived ease of use is independent from the keyboard setup. 
 
2
H0,1 could be rejected at an α-level of 0.05 (χ (4) = 15.9, p < 0.05) which suggests that there is a significant 
preference for a certain keyboard setup. The smaller widely spaced buttons were chosen by the majority of 
the participants as their most preferred setup. Therefore, it could be concluded that there was a significant 
preference for this setup although the results of Chapter 7 indicated that there was no significant difference 
between the effectiveness and efficiency of these setups. 
2
The second null hypothesis could also be rejected at an α-level of 0.05 (χ (4) = 19.0, p < 0.05), which indicates 
that a certain setup is significantly easier to use than others. Once again, the smaller widely spaced buttons 
were chosen as the easiest to use therefore it could be said that they were perceived as being significantly 
easier to use than the other setups. Similar to the preference of the setups, although there was a significant 
difference in the subjective usability of these setups, results from the previous chapter indicate that there was 
no objective difference between these three setups. 
191 
 
Chapter 8 Participant Subjective Satisfaction 
8.5 Commands 
 
8.5.1 Satisfaction 
Using the same procedure as with the previous sections, the subjective satisfaction of the participants towards 
using the speech commands was gauged and divided into categories. The same questions as in the previous 
sections were used and then categorised into the negative, neutral and positive categories. The following 
hypothesis was formulated: 
H0: The responses in the possible categories are evenly distributed. 
A contingency table for the first question is given as an example below. 
 
 Table 8.11: Example of contingency table for satisfaction with speech commands 
 Terrible Average High 
Responses = 1 Responses = Responses = 4 
or 2 3 or 5 
Observed responses 2 8 14 
Expected responses 8 8 8 
 
Descriptive statistics for these questions are summarised in Table 8.12. The first column contains the 
adjectives used on the scale for the questions and the rightmost column summarises the results of the Chi-
square test performed on each question. 
 
 Table 8.12: Descriptive statistics for satisfaction questions for the command feature 
 Mode Mean Standard Chi-square test 
deviation 
2
Terrible – Wonderful 4 3.7 0.9 χ (2) = 5.2, 
0.05 < p < 0.10 
2
Frustrating – Satisfying  3 3.4 1.1 χ (2) = 1.0, 
p > 0.05 
2
Dull – Stimulating 4 4.0 0.8 χ (2) = 10.0, 
p < 0.05 
2
Difficult – Easy 3 3.7 1.0 χ (2) = 3.1, 
p > 0.05 
2
Inadequate – Adequate 4 3.5 1.0 χ (2) = 2.6, 
p > 0.05 
2
Rigid – Flexible 4 3.6 1.1 χ (2) = 2.2, 
p > 0.05 
2
Unnatural – Natural 3 3.6 0.9 χ (2) = 4.6, 
0.05 < p < 0.10 
 
The null hypothesis could not be rejected for any of the questions at an α-level of 0.05 which indicates that 
there was no significant difference in the level of satisfaction for any of the questions. The scales of terrible – 
wonderful and unnatural – natural could however be rejected at an α-level of 0.1. The majority of the 
participants felt that the use of commands was wonderful and that it felt natural. The responses of the 
questions were mapped to a scale of 1 to 5 and then a mean score was calculated for each question using 
these mappings. The mean scores for the questions, as measured on a scale of 1 to 5, were all positive for the 
192 
 
Chapter 8 Participant Subjective Satisfaction 
speech commands. This tendency was confirmed through inspection of Chart 8.5 which shows the number of 
responses in each category.  
Terrible - Wonderful 2 8 14
Frustrating - Satisfying 5 10 9
Category 
Dull - Stimulating 1 5 18 Negative
Difficult - Easy 3 9 12 Neutral
Positive
Inadequate - Adequate 4 7 13
Rigid - Flexible 5 6 13
Unnatural - Natural 2 10 12
 
Chart 8.5: Number of responses in each category for satisfaction questions for command feature 
 
From the graph it can clearly be seen that using speech commands was judged satisfactory by a larger number 
of participants for all the rating scales than the typing. Similar to the typing feature, the scale of frustrating – 
satisfying had the lowest number of responses for high satisfaction. This could easily be attributed to the fact 
that ambient noise often interfered with the speech recognition and sometimes led to unexpected responses 
from the application. This could at times be frustrating for the participants. Even so, the level of satisfaction 
experienced was more than acceptable. 
In comparison with the satisfaction experienced with typing, the commands appear to be more gratifying for 
the participants. They were ranked as more stimulating, adequate, flexible and natural by more participants 
than the typing was. The objective results reported in Chapter 6 also indicated that in some instances the use 
of speech commands was comparable to user performance when using the keyboard, while in others the 
speech commands performed even better than the keyboard. Therefore, in this case user satisfaction closely 
mirrors the objective usability measurements. 
 
8.5.2 Learnability 
The same four questions used to gauge learnability in the previous two sections were posed to the participants 
in respect of the learnability of the commands. Similar to previous sections, the responses were grouped into 
the three categories of difficult, neutral and easy (see Table 8.7 for an example). The following hypothesis was 
formulated: 
H0: The responses in the possible categories are evenly distributed. 
Descriptive statistics for the questions relating to learnability are summarised below. Scores are measured on a 
scale of 1 to 5. The first column indicates the question together with the scale used, while the final column 
reports the results of the Chi-square test. 
193 
 
Chapter 8 Participant Subjective Satisfaction 
The null hypothesis could not be rejected for any of the questions which indicates no significant number of 
responses in a particular category. However, the mean response for all of the questions ranks on the positive 
side of the scale which offers some encouragement as to the learnability of the commands. 
 Table 8.13: Descriptive statistics for learnability questions for the command feature 
 Mode  Mean Standard Chi-square 
deviation test 
2
Difficult – Easy (Overall 4 3.6 1.1 χ (2) = 3.3, 
learning) p > 0.05 
2
Difficult – Easy (Getting 3 3.5 1.2 χ (2) = 1.6, 
started) p > 0.05 
2
Difficult – Easy (Learning 4 3.6 1.1 χ (2) = 2.6, 
advanced features) p > 0.05 
2
Slow – Fast (Time to learn) 3 3.6 1.0 χ (2) = 4.5, 
p > 0.05 
 
8.5.3 Types of commands 
The questionnaire also required the respondents to distinguish between the different types of actions and 
functions that could be achieved through the use of speech commands and to rank them on a 5-point scale 
ranging from difficult to easy. The four types of functions which could be completed using speech commands 
were navigation (moving the cursor), formatting text (for example, bold and italic), selecting text and actions 
such copying, cutting and pasting. The following table is the contingency table which was used to analyse the 
satisfaction with moving the cursor: 
 Table 8.14: Contingency table for satisfaction with moving the cursor 
 Low Neutral High  
Observed responses 9 2 12  
Expected mean 8 8 8  
 
The table below contains descriptive statistics for these questions as well as the Chi-square test results for 
each question. The following hypothesis was tested for each question: 
H0: The responses in the possible categories are evenly distributed. 
 Table 8.15: Descriptive statistics for satisfaction of command types 
 Mode Mean Standard Chi-square test 
deviation 
2
Moving the cursor 2 3.3 1.4 χ (2) = 4.4, 
p > 0.05 
2
Formatting text 5 4.3 0.7 χ (2) = 11.2 *, 
p < 0.05 
2
Selecting text 5 4.3 1.0 χ (2) = 12.3, 
p < 0.05 
2
Cutting/copying and pasting 5 4.6 0.7 χ (2) = 16.4 *, 
p < 0.05 
 * Yates corrected Chi-square applied 
194 
 
Chapter 8 Participant Subjective Satisfaction 
From the table it can be extrapolated that the null hypothesis could be rejected for all functions except for 
navigation. Nine participants felt that navigation was difficult while twelve felt that it was easy. With the two 
neutral responses this gives a fairly even spread of opinions of the difficulty level of navigating using speech 
commands. In contrast, twenty participants felt that formatting text was easy, 20 indicated that selecting text 
was easy and 23 found cutting/copying and pasting easy. Therefore, it could be concluded that functions 
achievable through the issuing of speech commands were considered easy by the majority of the participants 
except for having to navigate through the document. Since it was discovered during the analysis of the number 
of actions used that most participants moved the cursor one character at a time and failed to use shorter 
methods of moving the cursor, it is entirely plausible that participants found navigation a laborious process 
when having to use speech commands. Even so, when inspecting Chart 8.6 below, it is clear that when 
grouping the responses in the difficult, average and easy categories, most of the respondents found this 
function easy to master. 
 
Moving the cursor 9 2 12
Category 
Formatting text 0 4 20 Difficult
Average
Easy
Selecting text 2 2 20
Cutting/copying and pasting 10 23
  
Chart 8.6: Number of responses in each satisfaction category for command types 
 
8.6 Additional considerations 
The questionnaire also contained a number of yes/no type questions to gauge user response and willingness to 
continue using the features of the multimodal interface (Appendix H, questions 15 – 20). Chart 8.7 shows a 
stacked graph of the response rate for each of these questions, which for these purposes have been rewritten 
in a shorter format. 
The graph clearly shows that the majority (14) of the respondents would prefer to have both visual and audio 
feedback when a keyboard button is pressed. While the audio feedback serves the purpose of alerting the user 
to the fact that a character has successfully been inserted into the document (21 participants responded 
positively), participants seem to prefer that a visual cue be given in order to confirm which letter was inserted 
into the document. During development of the application it was considered sufficient that the button which 
had focus be framed and that this would serve as an indicator as to which character had been inserted. Clearly, 
while this assists the participants in the knowledge of which button has focus, they desire a different 
mechanism as confirmation of activation. Therefore, in this instance, user preference closely correlates to the 
option which was found to have the highest text entry speeds (Majaranta, 2009). 
195 
 
Chapter 8 Participant Subjective Satisfaction 
Yes No
Would you prefer visual feedback together with 14 10
the audio feedback?
Did the audio feedback assist with the typing 21 3
process?
Would you use it to issue commands? 19 5
Would you use this for typing purposes? 15 9
Did you feel more at ease after becoming 24 0
accustomed to the application?
Did you feel as though your typing improved? 21 3
0 5 10 15 20 25
 
Chart 8.7: Number of responses in each category for additional considerations of using eye gaze and speech 
 
The responses to indicate continued use of the features are very encouraging because in all instances the 
participants showed an overwhelming eagerness to continue using the features made available for them. Most 
participants would consider using the speech commands as well as the onscreen typing. Furthermore they felt 
more at ease with the system the more they used it which is an important factor to bear in mind in terms of 
product acceptance. Results in Chapter 7 indicate that there was no significant improvement in the typing 
speeds as the weeks progressed, but the fact that participants felt they had improved could improve the 
acceptance of and satisfaction with the multimodal interface. 
The table below contains the results of the Chi-square test used to analyse the following hypothesis: 
H0: The responses are evenly distributed over the possible categories. 
 Table 8.16: Analysis results for satisfaction of additional considerations 
 Chi-square test 
2
Did you feel as though your typing improved? χ (2) = 7.9, 
p < 0.05 
2
Did you feel more at ease after becoming accustomed to χ (2) =13.4 *, 
the application? p < 0.05 
2
Would you use this for typing purposes? χ (2) = 0.8, 
p > 0.05 
2
Would you use it to issue commands? χ (2) = 4.5, 
p < 0.05 
2
Did the audio feedback assist with the typing process? χ (2) = 7.9, 
p < 0.05 
2
Would you prefer visual feedback together with the audio χ (2) = 0.3, 
feedback? p > 0.05 
 * Yates corrected Chi-square applied 
The null hypothesis could be rejected for all questions except for the question pertaining to the inclusion of 
visual feedback. These results confirm that a significantly larger proportion of the respondents would consider 
continued use of the application for both commands and typing. Additionally, they felt as though their typing 
196 
 
Chapter 8 Participant Subjective Satisfaction 
skills with the onscreen keyboard had improved and that they had become more at ease with the application 
as they got more exposure.  The findings also confirm the importance of feedback as a significant number of 
participants felt that audio feedback enhanced the typing process. 
 
8.7 Pointing device 
The device assessment questionnaire advocated for use by the ISO 9241-9 for the usability of pointing devices 
was given to the participants after the first session and the last session (Appendix H). They were instructed to 
complete that part of the questionnaire with reference to using eye gaze and speech for text input. The 5-
point scale used was divided into three categories for each question using the two lowest scale points, the 
midpoint and the two highest. The table below serves as an example and is a contingency table for the first 
question pertaining to actuation. 
 Table 8.17: Example of a contingency table for device assessment questions 
 Low Neutral High  
First session 6 11 8  
Last session 3 16 5  
 
A Chi-square test was used to determine whether the opinion of the participants changed between the first 
and the last session. Descriptive statistics and results of the Chi-square test are summarised in Table 8.18. The 
first column indicates the characteristic of the device which was being evaluated as well as the scale on which 
it was evaluated. The following null hypothesis was formulated: 
H0: There is no difference in the assessment of the device after the first session and after the last 
session. 
The null hypothesis could not be rejected for the majority of the assessment questions which suggests that the 
opinion of the participants to the use of eye gaze and speech was not swayed during extended use thereof. 
The only question for which there was a significant difference in the responses was concerned with the 
smoothness of the operation. In this instance, participants were more positive about the smoothness of the 
operation after prolonged use of the application. Considering that the overall response to the eye gaze and 
speech was positive in terms of the effort required, the speeds achieved and the fatigue or discomfort it may 
cause, this is a very encouraging result. 
 
8.8 Anecdotal observations 
During each session, participants were closely observed by the facilitator as they completed the tasks. A 
number of general observations were made during the ten week period. The fact that the application was 
susceptible to ambient noise appeared to cause the greatest frustration for the participants. In the first few 
weeks they did resort to “hyperarticulation” (Oviatt et al., 1998) when the application did not respond as they 
wanted it to. Even so, after some weeks the participants learnt to compensate for these problems and did not 
react negatively as they had in the first few weeks. This observation confirms the suspicion that users will be 
flexible and resilient enough to overcome some of the limitations of the technologies (Nusbaum et al., 1995). 
Furthermore, the fact that some ambient noise was present provided a more natural and “real-world” 
environment which allowed the application to be tested in an environment resembling one in which actual use 
may occur, such as an office setting. 
197 
 
Chapter 8 Participant Subjective Satisfaction 
Table 8.18: Descriptive statistics for device assessment questionnaire responses 
 First session Last session  
 Mode Mean Std dev Mode Mean Std dev Chi-square test 
2
Actuation 3 3.2 1.1 3 3.0 0.7 χ (2) = 2.6, 
(too low – too high) p > 0.05 
2
Smoothness 3 3.0 1.1 3 3.3 0.7 χ (2) = 7.1, 
(very rough – very p < 0.05 
smooth) 
2
Mental effort 2 3.2 1.0 3 3.0 1.0 χ (2) = 0.6, 
(too low – too high) p > 0.05 
2
Physical effort 2 2.6 1.2 2 2.5 1.1 χ (2) = 2.5, 
(too low – too high) p > 0.05 
2
Accurate pointing 3 3.6 1.1 4 3.6 0.9 χ (2) = 4.3, 
(easy – difficult) p > 0.05 
2
Operation speed 3 2.6 0.9 3 2.7 0.8 χ (2) = 0.2, 
(too fast – too slow) p > 0.05 
2
Neck fatigue 1 2.0 1.1 3 2.9 1.3 χ (2) = 4.7, 
(none – very high) 0.05 < p < 0.10 
2
General comfort 4 3.8 1.0 3 3.3 1.0 χ (2) = 2.5, 
(very uncomfortable p > 0.05 
– very comfortable) 
2
Overall use 4 3.4 1.3 4 3.5 0.8 χ (2) = 1.370, 
(very difficult – very p > 0.05 
easy) 
  
Due to the orientation of the onscreen keyboard it was found that accuracy was heightened if a participant 
looked down at the keyboard with the screen tilted slightly downwards. 
Informal conversations with the participants were also held after each session. During these conversations it 
became clear that participants enjoyed using the application the more they were exposed to it. They also 
became more comfortable issuing verbal commands and using the onscreen keyboard to type. The opinions 
captured during the interviews were the same for many of the participants and unfortunately did not serve to 
complement the study to the degree which was initially hoped when the experimental methodology was 
planned. 
One participant die share a fairly humorous experience he had during the testing period. He had become so 
accustomed to using the onscreen keyboard and enjoyed using it so much that when faced with typing an 
assignment he found himself issuing verbal commands to complete both the typing and formatting. Naturally, 
the application did not respond, which led to great disappointment on his part. Quite a large majority of the 
participants indicated that they would use the application if they had access to it. 
198 
 
Chapter 8 Participant Subjective Satisfaction 
Observations of some of the participants over the weeks led to the supposition that they valued speed above 
accuracy when typing using the onscreen keyboard. When typing using the onscreen keyboard it became 
obvious that they were attempting to achieve high typing speeds, often to the detriment of accuracy as they 
scarcely gave the indicator time to stabilise before issuing commands and looking quickly at the next letter. 
During an interview after the ninth session, one participant commented that he found it easier to move along 
the keys on the keyboard one at a time than trying to focus directly on the next desired letter from the 
previously typed one. He perceived this as being much easier to type and maintain a stable gaze. Although this 
would increase his typing time, he felt the increase in the accuracy was a worthwhile trade-off. This behaviour 
was also observed in some of the other participants which was in direct contrast to some of the observed 
behaviour mentioned before. This indicates that the desired balance of accuracy and speed is a highly 
subjective one. 
 
8.10 Summary 
This chapter discussed the results of the questionnaires which were completed by study participants after the 
first and last session of the longitudinal user testing. The questionnaires were designed to elicit the subjective 
opinions of the participants towards the use of speech for issuing commands in a word processor, using eye 
gaze and speech to type and the overall satisfaction experienced with the application. 
Overall, the satisfaction of the participants was positive with very little shift in opinion from the first session to 
the last. However, the use of eye gaze and speech to type was ranked as very unnatural – a mindset which 
could perhaps be changed after more exposure and practice with the system. The positive response to the 
satisfaction and learnability of the application is heartening as this may increase the acceptance rate of such 
an application by mainstream users of a word processor. 
The next chapter will provide a discussion on the results of the complete study and make recommendations 
for further study and use. 
 
  
199 
 
 
CHAPTER 9 
CONCLUSION 
 
9.1 Introduction 
The previous chapters reported on the results of the statistical analysis of the user testing. This chapter will 
start by providing a summary of the results and how they serve to answer the research questions which were 
originally posed. This will be followed by the limitations of the study and the recommendations based on the 
findings. Finally, the implications the study has for the future will be discussed. 
 
9.2 Motivation  
The motivation of the study was essentially threefold. Firstly, as the word processor is a popular, everyday tool 
of a majority of computer users, it offers an environment well-suited to improvement and exploitation of 
emerging technologies. Secondly, the future of user interfaces indicates that there will be a movement away 
from traditional GUI interfaces. This presented an opportunity to provide and test a multimodal interface for 
the word processor which uses non-traditional interaction techniques. Thirdly, this could offer a customisable 
interface which could potentially cater for a very diverse group of users. In particular, it offered the potential 
of including the oft-marginalised disabled users into the mainstream group of users by providing a means of 
interaction which is not dependent on the keyboard and mouse. 
 
9.3 Aim 
The study had three main aims which were investigated. The first aim was simply to determine whether a 
customisable multimodal interface could be developed for a mainstream word processor. This interface should 
have the potential to offer a variety of interaction means which could be set according to the needs and 
environment of the user. The second aim was to determine whether the interface that was developed was 
feasible within the confines of the mainstream word processor and whether it exhibited long term potential as 
a viable future interface. The final aim was to determine how usable eye gaze and speech are as an interaction 
technique within a word processor. This aim could be subdivided into three secondary aims based on the types 
of interaction required within a word processor. To begin with, it was necessary to determine whether eye 
gaze and speech could be used to replace a pointing device. Thereafter, it had to be determined whether 
common word processing tasks could be accomplished using the proposed interaction techniques. Finally, text 
entry is the most integral part of a word processor and it had to be established whether eye gaze and speech 
could be used for text entry in a usable manner and whether it was comparable to the traditional means of 
text entry. 
 
9.4 Results 
Each of the aims, which led to individual research questions, was explored using specific tests and 
experimental designs. The results of the tests will be briefly summarised in this section in the order that they 
were conducted to answer the research questions. 
200 
 
Chapter 9 Conclusion 
9.4.1 Multimodal word processor 
Can a customisable multimodal interface be developed and successfully incorporated into a mainstream 
word processor? 
A multimodal interface using eye gaze and speech was developed and incorporated into Microsoft Word 2007 
(Chapter 3). The multimodal interface provided a number of different interaction techniques which facilitated 
a hands-free environment. The main aim was to provide an interface which was highly customisable since a 
word processor is used by a very diverse group of users and in varying environments.  
As a result, the multimodal interface provided a speech grammar for common word processing tasks such as 
formatting, navigation, text manipulation and text selection. Eye gaze was then incorporated into the interface 
using identified activation mechanisms. Dwell time, look-and-shoot (with the Enter key) and blinking were all 
available to use as interaction techniques. The configuration of the onscreen keyboard was QWERTY by default 
but could also be changed to an alphabetical layout. The setting used for dwell time as well as the sensitivity of 
the pronounced blink could be set by the users to meet their needs. 
The interaction of interest in this case was the combination of eye gaze and speech. Using eye gaze to control 
the focus and indicate which letter was required, the user could issue a speech command to type the letter in 
the current document at the current cursor position. The combination of eye gaze and speech could provide 
improved speed as it is not dependent on a dwell time. It could also exhibit increased accuracy as there is 
almost no potential for inadvertently activating the incorrect button due to prolonged eye gaze or a blink. 
Potentially, in order to increase the accuracy even more, magnification could also be used to increase the area 
directly under the gaze of the user. All activation techniques were still available to be used while the 
magnification was turned on. 
The resulting interface was one which incorporated speech in terms of dictation and speech commands and 
eye gaze which could be used in a number of ways in isolation or with speech as an activation mechanism. 
The successful development and incorporation within Word 2007 proved that such a highly customisable, 
multimodal interface was indeed possible for both a multimodal interface and within a mainstream 
application. 
 
9.4.2 Feasibility study 
How feasible is such an interface and in which context is it feasible? 
The development of such an interface does not implicitly mean that such an interface is feasible or viable by 
any means. Therefore, before user testing commenced, a feasibility study was undertaken to determine 
whether there was any potential for adoption of such an interface. 
In order to determine this, a number of experienced HCI researchers participated in a short demonstration of 
the application, interacted with the system and then completed a questionnaire designed to determine 
whether they felt that the multimodal interface had long-term use potential. Results indicated that the 
interface was a move in the right direction, particularly as a means of exposing disabled users to a mainstream 
application. Furthermore, they were positive about the potential that the interface offered and felt that in the 
long-term the use thereof may be beneficial to a wide group of users. 
201 
 
Chapter 9 Conclusion 
9.4.3 User testing 
How usable is the multimodal interface compared to the traditional interaction techniques? 
The afore-mentioned results could not be statistically verified and served only as a means to determine 
whether the proposed interface was possible and viable. The next step of the study was to determine how 
usable the application was. However, a comparison of all the interaction techniques would require a large 
scale project to be undertaken which was beyond the scope of this thesis. Therefore, the proposed interaction 
technique of eye gaze and speech was tested as a pointing device, to format a document and for text entry. 
Where applicable, this was then compared to other means of input. This allowed three secondary questions to 
be posed in order to answer the above-mentioned research question. 
 
9.4.3.1 Usability of eye gaze and speech as a pointing technique 
How usable is the combination of eye gaze and speech when used to replace a pointing device? 
The multi-directional tapping ISO assessment (ISO, 2000) was used to determine the usability of eye gaze and 
speech to replace a pointing device. Throughput, selection time, target re-entries, incorrect target acquisitions 
and incorrect clicks were compared for eye gaze and speech when using a gravitational well (ETSG), 
magnification (ETSM) and the absence of both (ETS). These were then compared to the mouse as a pointing 
device. 
The feedback provided to users did not influence the usability of the interaction technique in any way although 
participants preferred using the framed feedback. The inverted colour results in the button being a very dark 
gray which may be too harsh for the majority of the users. A background colour which is not quite so dark may 
be more pleasant for the users. 
The mouse had a significantly higher throughput than all of the eye gaze interaction techniques, but inclusion 
of the gravitational well allowed the eye gaze to perform significantly better than the other eye gaze 
interaction techniques. The mouse throughput was within the expected range as reported by Soukoreff and 
MacKenzie (2004) but all interaction techniques did exhibit improvement over the three sessions.  
The behaviour used to stabilise and select targets in terms of incorrect target acquisitions and target re-entries 
suggests that ETS and ETSM require the most effort to maintain a stable eye gaze. This is contrary to 
expectations since the magnified button should become easier to acquire and maintain due to its larger 
surface area. The high incidence of target re-entries indicates that the eye gaze tends to slip off the target 
frequently even though theoretically the larger surface area should increase the ease with which the target 
can remain selected. Participants also tended to attempt to fine-tune the position when using ETSM possibly 
due to the disturbance experienced by the magnification, the feedback given by the mouse pointer which 
indicates the gaze position may be close to the button or the fact that the larger button is perceived to be 
easier to acquire. In contrast, when using ETS participants tended to focus on another button and then glance 
back to the required target. Therefore, in this instance it is possible that the feedback provided by the 
magnification tool may have altered the behaviour of the user. However, the evidence suggests that it in no 
way increased the efficiency or effectiveness of target selection and therefore it is not recommended for use. 
In response to the questions posed in the questionnaire it would appear that the use of the magnification is 
not a pleasant experience for the participants but no comments were made specifically about the visual 
feedback which was given. 
Incorrect clicks were experienced with all eye gaze interaction techniques although more so with ETSG. 
Participants did, however, improve dramatically with ETSG to a level that was comparable with the other eye 
202 
 
Chapter 9 Conclusion 
gaze techniques. Incorrect clicks were most probably caused by the fact that participants have a tendency to 
acquire the next target whilst still issuing the verbal command. Once the verbal command has been processed, 
they have already achieved a stable gaze on the next target, thus causing this target to be selected. This closely 
resembles the behaviour seen when performing an action (Land & Tatler, 2009) or issuing verbal commands in 
an application (Maglio et al., 2000). If this theory holds, the number of missed clicks should be higher for ETSM 
and ETS as the click would occur while users were still attempting to acquire the next target. Unfortunately 
these were not measured and this will be held over for possible further research to confirm the proposed 
theory.  
It was found that the use of a gravitational well significantly improves the usability of eye gaze and speech as a 
pointing device, with this interaction technique being the closest rival to the mouse in terms of usability. While 
magnification is proposed as a means to increase the accuracy of eye gaze selection, it was found in this case, 
that magnification did not increase the usability of eye gaze and speech as a pointing device at all. 
 
9.4.3.2 Usability of speech commands 
How usable are speech commands for performing common word processing tasks? 
The next phase of the study was to test the multimodal interface within Microsoft Word. For this purpose, user 
testing was identified as the most suitable means of exploring the research question. Representative word 
processing tasks which encompassed navigation, formatting and editing were selected for use. Objective 
usability measurements were captured while users completed the tasks using the speech commands and when 
completing them with the traditional interaction techniques.  
The tasks could be divided into generalised types of tasks. The table below contains a summary of the results. 
Where there was a significant difference between the interaction techniques, an “S” indicates that speech was 
significantly better while “K” indicates that the keyboard was significantly better. Where there is a tick mark it 
indicates that both interaction techniques improved significantly over time. 
Table 9.1: Summary of results for speech commands 
 H0,1: Interaction technique H0,2: Session 
 Completion Number of Completion Number of 
time actions time actions 
Line selection and formatting    S 
Select all and remove S S   
Select words and format K    
Paste S S   
Undo     
Select word and copy     
Position and paste K K   
 
In most instances, speech performed comparably with the keyboard and sometimes even outperformed it for 
both measurements. Selection of words and lines were completed with comparable efficiency and 
effectiveness with the keyboard and with speech. Since selection of a line at a time and the whole document 
were more efficient with the speech or similar to using the keyboard, other selection techniques such as 
selection of a paragraph should be easily accommodated into the grammar and adopted by the users. It was 
only where isolated words had to be navigated to that the speech could not compete with the keyboard.  
This could be due to two reasons. Firstly, users may not have realised that using a command which had an 
immediate effect that was not required could eventually lead to the correct result. Secondly, it was perhaps 
203 
 
Chapter 9 Conclusion 
due to the fact that users are not able to navigate efficiently with the keyboard only and therefore could not 
translate the verbal commands into efficient navigation techniques. It may be advisable to first train the users 
in using the keyboard for navigation and then retest them. However, since the word processor enjoys such 
widespread use, the grammar should ideally be as intuitive as possible and not require any training. Therefore, 
for other means of navigation, more intuitive commands may have to be provided. To reduce the chances that 
unwanted connotations may be invoked or that the user won’t understand the command, it is advised that a 
study be conducted in which participants are shown how to use the keyboard for navigation and then asked to 
suggest verbal commands for these navigational techniques. 
The results of this study allow the conclusion to be drawn that speech commands can be used effectively and 
efficiently in an editing environment and that the use of a menu-orientated grammar may induce rapid 
learning and use of the grammar. 
 
9.4.3.3 Usability for text entry 
How usable is the combination of eye gaze and speech when used for text entry? 
Text entry was also tested using longitudinal user testing by requiring users to enter phrases from a pre-
selected phrase set using both the keyboard and eye gaze and speech with an onscreen keyboard. Three 
onscreen keyboard configurations were tested, namely large buttons, smaller widely spaced buttons and 
smaller closely spaced buttons. In all instances, the QWERTY keyboard layout was used. 
When comparing the keyboard with the large buttons, the keyboard had significantly fewer errors than the 
eye gaze and speech. However, there was significant improvement between the second and last sessions and 
the third and last sessions. Therefore, the number of errors decreased dramatically as the participants became 
accustomed to the interaction technique. When inspecting the insertions, deletions and substitutions it was 
found that there were significant differences between the insertions and substitutions. The higher incidence of 
insertions and substitutions corresponds closely with the finding of the high number of incorrect clicks with 
this interaction technique. Users were instructed not to delete erroneous characters as this would allow an 
accurate measure of the types of errors that were made. Therefore, the fact that the average number of 
insertions was higher than the average substitutions indicates that the users at least noticed their errors and 
then inserted the correct character. Future studies could allow users to correct their errors and then measure 
both the correctness of the transcribed text as well as the number of corrections required. 
Text entry using the keyboard was also significantly faster than using eye gaze and speech. There was no 
significant improvement over the sessions for the eye gaze and speech input. 
When comparing the keyboard, large buttons, smaller widely spaced buttons and smaller, closely spaced 
buttons, the keyboard differed significantly from both of the smaller button configurations in terms of the 
number of errors made. In terms of text entry speed, the keyboard was significantly faster than all other input 
configurations. 
 
9.4.3.4 Satisfaction 
The overall reaction to the system was fairly positive and on a comparable level after the first exposure and 
extended exposure. The majority of the participants preferred using the smaller, widely spaced buttons even 
though they did not facilitate a faster typing speed. This preference could be due to the reduced space that is 
occupied by the keyboard or the fact that the smaller buttons resemble standard sized buttons more than the 
others.  Therefore, there could be consequences for the space which is lost to the onscreen widgets required 
for eye gaze interaction although direct questions should be posed to elicit this. 
204 
 
Chapter 9 Conclusion 
In terms of the naturalness and satisfaction experienced when using speech commands, most participants felt 
that they were most natural and enjoyed using them. The types of commands which were used were also 
pleasant to the participants although satisfaction was lowest for the navigation using speech commands. This 
corresponds closely with the findings of the objective usability measures. 
Many participants would have preferred having visual as well as audio feedback while typing. The use of visual 
feedback could increase the typing speed as it cannot be definitively stated that participants did not look at 
the document in order to confirm their typing progress. This could have a significant impact on the typing 
speed of users. 
 
9.5 Recommendations 
Eye gaze and speech appears to be a suitable combination for pointing, in particular when a gravitational well 
is activated around the targets. 
Originally, it was assumed that the presence of the visual feedback would be enough incentive for the users to 
keep a steady gaze on the button until the button had been activated. However, it seemed to be a natural 
occurrence for the user to glance at the next target whilst still issuing the command. This resulted in 
significantly more incorrect clicks with the ISO testing and quite possibly a higher error rate with the text entry. 
A recommendation to solve this problem would be to activate the button which has focus at the start of the 
utterance and not at the end of the utterance. This could increase the accuracy as well as the speed with 
which text can be input. Higher speeds can be attained as the user will not have to wait for confirmation 
before proceeding. One drawback of this method is the confirmatory audio beep which was given to alert 
users that the button had successfully been activated. If the user is allowed to look at the next target before 
the click has been executed it means the auditory feedback will have to be given before the successful 
execution of the command. If the auditory feedback is given at the end, then the user may already be focusing 
on the next button and the feedback could cause some confusion. Since users indicated a high preference for 
audio feedback, the usability of no audio feedback, a more premature beep and a delayed beep will have to be 
investigated. 
The speech grammar should be minimised to include only the commands for typing when gaze is detected on 
the keyboard. This should be automatic and could potentially reduce the number of errors incurred since, for 
example, the cursor will not move around because of triggering through ambient noise. In order to allow 
formatting to occur while typing, there should be a mechanism for the user to extend the grammar to its full 
length as well as automatically extending it when the eye gaze leaves the keyboard. Furthermore, speech is 
recommended for use as an input technique to accomplish common word processing tasks. 
Recommendations for the use of eye gaze and speech for text entry cannot be made at this stage as results 
were inconclusive when comparing objective and subjective measurements. Objective measurements indicate 
less productivity but subjective measurements indicate that users enjoyed using it and would like to use it in 
future. Typing via the means of eye gaze and speech is no faster than other means using eye gaze and a more 
efficient means, such as word completion algorithms are needed. However, the method should not be 
summarily disregarded as users may simply require more practice in order to type faster. For example, the 
speeds which can be achieved for typing using a cell phone keyboard are often outstanding and are achieved 
through extended practice. Proper motivation to use the text entry method could increase the speed with 
which it is adopted and used. 
In conclusion, the incorporation of a multimodal interface using eye gaze and speech in a mainstream word 
processor is recommended, as it increases the potential penetration of the application. Moreover, users are 
205 
 
Chapter 9 Conclusion 
accustomed to using technology to meet their specific needs and the use of such an interface could increase 
the satisfaction that users have with the application. 
 
9.6 Implications for the future 
As evidenced by the incorporation of the technologies used in the multimodal interface of this study, the time 
has perhaps dawned when they should be exploited as replacement interaction techniques. Speech 
recognition has become a standard feature in personal computers and is often available for dictation 
purposes. Similarly, there are packages available for purchase which can react to spoken commands (cf. 
Dragon, nd). Furthermore, the first fully-integrated eye-controlled laptop has recently been showcased at 
exhibitions (Tobii, 2011) and bodes well for the adoption of eye-tracking as a standard feature in personal 
computers. Cheaper, accurate eye-trackers (cf. Haro et al., 2000) are also available which could function just as 
well as a standard interaction technique. 
Therefore, the fact that a popular mainstream application can be adapted to include a highly customisable, 
multimodal interface could be a step in the right direction for the next generation of interfaces. The 
multimodal user interface displays great potential and test results indicate that the interaction techniques can 
be used for pointing and selecting tasks and common word processing tasks. Moreover, it has been proven 
that speech recognition can indeed be used for editing commands in a word processor which was contrary to 
theoretical beliefs (Klarlund, 2003). This could mean that in the future a more diverse group of users can be 
accommodated and disabled users may no longer have to be relegated to using specialised applications. 
The findings therefore suggest that the word processor is well placed to include such an interface in future 
developments as the technology is rapidly becoming available. As it is foreseen that access to the technologies 
by mainstream users is imminent, future word processors could be developed with multimodal interfaces 
incorporated.   
 
9.7 Further research 
The results of the study unlocked a myriad of possibilities for further research in this area. Firstly, there are a 
number of possibilities for testing the use of eye gaze and speech as a pointing device. For example, eye gaze 
and speech can be compared to dwell time activation to determine if it is comparable or even superior for 
pointing purposes. In order to negate the effects of a possibly slow recognition engine, a Wizard of Oz 
experiment can also be tested. Speech commands can also be captured and executed based on the gaze 
position prior to when the command was recognised and processed. Furthermore, the use of a double 
command system similar to the touch sensitive mouse can be investigated. Speech commands can also be 
tested in different environments since this study was conducted in a controlled environment which may not be 
indicative of the actual use of such a system. Secondly, since ribbon icons are larger and it has been shown 
that common tasks (which still have smaller icons) can be accommodated in speech grammar more complex 
tasks can be tested by expecting users to interact with the ribbon using eye gaze and speech. This will provide 
evidence as to whether a full-length grammar is required or whether the ribbon used in Word 2007 and 
onwards is conducive to eye gaze as a pointing device. 
Thirdly, in terms of speeding up typing using eye gaze and speech there are a number of areas for further 
research. Word completion algorithms can be used to reduce the number of keys which have to be selected. 
Visual feedback can be coupled with the audio feedback which was used in order to confirm that a keyboard 
button has been pressed. Participants can be expected to practise more text entry tasks to determine whether 
the speed of text entry can be increased through protracted practice. Typing free text instead of presented 
206 
 
Chapter 9 Conclusion 
text can also be tested to determine whether the use of eye gaze negatively affects the compositional speed of 
the users. 
Fourthly, a more diverse user group must be tested on the interface including disabled and aged users. 
The study could also be replicated on a variety of eye-trackers which could have an impact on the results 
achieved. For example, a more accurate eye-tracker with higher precision could be tested. This could have a 
significant effect on the use of eye gaze as a pointing device both with and without the gravitational well. 
Furthermore, some of the cheaper, web-cam based eye-trackers could also be tested for their usability with 
the developed application. 
 
9.8 Summary 
Recent trends have indicated that the time has dawned to move away from the traditional direct manipulation 
interfaces. Non-command, attentive, perceptual and multimodal interfaces present a possible solution for the 
dilemma of providing a more natural and intuitive human-computer interaction. 
Gestures, speech and eye gaze are some of the natural mechanisms which are used by humans during 
communication. These offer a means of improving human-computer interaction. Speech and eye gaze were 
concentrated on in this study to create a multimodal interface for a popular word processor. 
The combination of eye gaze and speech could successfully be used to fulfil the needs of a pointing device, 
particularly when employed with a gravitational well. Furthermore, speech commands could be used to 
facilitate formatting of word processing documents. While text entry was slower than using a keyboard, 
indications are that there was an overall positive response to the interface and that it may well herald a 
suitable multimodal interface. The ease with which participants became accustomed to the interface is further 
proof of the naturalness and intuitiveness provided by speech and eye gaze. With constant progress being 
made in the development of the hardware required by such an interface, the proposed multimodal interface 
may well lay the foundation for a word processor to continue its exploitation of emerging technologies and 
remain a forerunner in the establishment of trends. While there is undoubtedly room for improvement and 
expansion, the use of eye gaze and speech has proven to be very promising. 
 
  
207 
 
 
REFERENCES 
 
Abran, A., Suryn, W., Khelifi, A., Rilling, J. & Seffah, A. (2003). Consolidating the ISO usability models. 
Proceedings of 11th International Software Quality Management Conference, Glasgow, Scotland. 
Accot, J. & Zhai, S. (1999). Performance evaluation of input devices in trajectory-based tasks: An application of 
the steering law. In Proceedings of CHI 99, Pittsburgh, Pennsylvania, United States of America, 466-472. 
Al-Qaimari, G. & McRostie, D. (2001). KALDI: A CAUsE tool for supporting testing and analysis of user 
interaction. In A. Blandford, J. Vanderdonckt and P. Gray (Eds), People and Computers XV – Interaction 
Without Frontiers: Joint proceedings of HCI 2001 and IHM 2001 (pp. 153-169). United Kingdom: Springer. 
Anderson, T. (2009). Pro Office 2007 development with VSTO. United States of America: APress. 
 
Ashdown, M. & Sato, Y. (2005). Attentive interfaces for multiple monitors. CHI 2005 Workshop on Distributed 
Display Environments, Portland, Oregon, United States of America. 
Ashmore, M., Duchowski, A. & Shoemaker, G. (2005). Efficient Eye Pointing with a Fisheye Lens. In Proceedings 
of Graphics Interface 2005, 203-210. 
Atchinson, D.A. & Smith. G. (2000). Optics of the human eye. Oxford: Butterworth-Heinemann. 
Bahill, A.T. & Clark, M.R. (1975). Glissades – Eye Movements Generated by Mismatched Components of the 
Saccadic Motoneuronal Control Signal. Mathematical Biosciences, 26, 303-318. 
Basson, S., Fairweather, P.G. & Hanson, V.L. (2007). Speech recognition and alternative interfaces for older 
users. Interactions, July/August 2007, 26-29. 
Bates, R. (2002). Computer Input Device Selection Methodology for Users with High-Level Spinal Cord Injuries. 
In Proceedings of the 1st Cambridge Workshop on Universal Access and Assistive Technology (CWUAAT), 
25-27 March 2002. Trinity Hall, University of Cambridge. 
Bates, R. & Istance, H. (2002). Zooming interfaces! Enhancing the performance of eye controlled pointing 
devices. In Proceedings of Assets 2002, 119-126. 
Bee, N. & André, E. (2008). Writing with your eye: A dwell time free writing system adapted to the nature of 
human eye gaze. Perception in Multimodal Dialogue Systems, 111-122. Berlin: Springer. 
Beelders, T.R. (2009). Graphics, text and language in a word processor interface. Germany: VDM Verlag. 
Berg, M., Grӧber, P. & Weicht, M. (2010). User study: Talking to computers. In Proceedings of the 3rd 
Workshop on Inclusive eLearning, London, United Kingdom, 19-32. 
Bergin, T.J. (2006a). The Origins of Word Processing Software for Personal Computers: 1976-1985. IEEE Annals 
of the History of Computing, 28(4), 32-47. 
208 
 
 References 
Bergin, T.J. (2006b). The Proliferation and Consolidation of Word Processing Software: 1985-1995. IEEE Annals 
of the History of Computing, 28(4), 48-63. 
Bernhaupt, R., Palanque, P., Winkler, M. & Navarre, D. (2007). Usability study of multi-modal interfaces using 
eye-tracking. In Proceedings of INTERACT 2007, 412-424. 
Bevan, N. & Macleod, M. (1994). Usability measurement in context. Behaviour and Information Technology, 
13, 132-145. 
Blignaut, P.J., Dednam, E.H. & Beelders, T.R. (2007). Die opleiding van persone uit benadeelde groepe in 
rekenaargebruik: Is die agterstand nie té groot om te oorbrug nie? (Training of people from 
disadvantaged communities in computer usage: Is the backlog too large to overcome?) Suid-Afrikaanse 
Tydskrif vir Natuurwetenskap en Tegnologie, 26(3), 216-235. 
Bobick, A., Intille, S., Davis, J., Baird, F., Pinhanez, C., Campbell, L., Ivanov, Y., Schutte, A. & Wilson, A. (1999). 
The Kidsroom: A Perceptually-Based Interactive and Immersive Story Environment. Presence: 
Teleoperators and Virtual Environments, 8(4), 367-391. 
Bohmann, K. (2000). User performance metrics. Retrieved 19 October 2005 from: 
http://www.bohmann.dk/articles/ user_performance_metrics.html. 
Bolt, R. (1980). “Put-that-there”: Voice and gesture at the graphics interface. Computer Graphics, 14(3), 262-
270. 
Bolt, R. (1981). Gaze-orchestrated dynamic windows. Computer Graphics, 15(3), 109-119. 
Bradley, J.V. (1958). Complete counterbalancing of immediate sequential effects in a Latin square design. 
Journal of the American Statistical Association, 53(282), 525-528. 
Carroll, J.M. (2003). HCI Models, theories, and frameworks: Towards a multidisciplinary science. San Francisco: 
Morgan Kaufmann. 
Castellina, E., Corno, F. & Pellegrino, P. (2008). Integrated Speech and Gaze Control for Realistic Desktop 
Environments. In Proceedings of the 2008 Symposium on Eye Tracking Research and Applications (ETRA), 
79-82. 
Cato, J. (2001). User-centered web design. Great Britain: Addison-Wesley. 
Chin, J.P., Diehl, V.A. & Norman, K.L. (1988). Development of an instrument measuring user satisfaction of the 
human-computer interface. In CHI ’88 Conference Proceedings: Human Factors in Computing Systems, 
213-218, New York: ACM/SIGCHI. 
COGAIN. (2006). An affordable future for eye tracking in sight. Retrieved 29 February 2008 from 
http://www.cogain.org/media/ files/COGAIN-IST-Results.pdf. 
209 
 
 References 
Cohen, P.R., Johnston, M., McGee, D., Oviatt, S.L., Clow, J. & Smith, I. (1998). The efficiency of multimodal 
interaction: A case study. In Proceedings of the International Conference on Spoken Language Processing, 
Sydney, Australia, 249-252. 
Cohen, P.R., McGee, D.R. & Clow, J. (2000). The efficiency of multimodal interaction for a map-based task. In 
Proceedings of the Applied Natural Language Processing Conference, 331-338. 
Corno, F., Farinetti, L. & Signorile, I. (2002). A cost-effective solution for eye-gaze assistive technology. In 
Proceedings of the 2002 IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 
433-436. 
Coutaz, J. & Caelen, J. (1991). A taxonomy for multimedia and multimodal user interfaces. In Proceedings of 
the Second East-West HCI conference, St Petersburg, Russia, 229-240. 
Dai, L., Goldman R., Sears, A. & Lozier, J. (2004). Speech-based cursor control: A study of grid-based solutions. 
In Proceedings of Assets ’04, Atlanta, Georgia, United States of Ameria, 94-101. 
th
Daintith, J. & Wright, E. (Eds). (2008). A Dictionary of Computing (6  Ed.). New York: Oxford University Press. 
De Luca, A., Weiss, R. & Drewes, H. (2007). Evaluation of eye-gaze interaction methods for security enhanced 
PIN-entry. In Proceedings of OzCHI 2007, Adelaide, Australia, 199-202. 
Désiltes, A., Fox, D.C. & Norton, S. (2006). VoiceCode: An innovative speech interface for programming-by-
voice. In Proceedings of CHI 2006, Montréal, Canada, 239-242. 
Dickinson, A., Gregor, P. & Dickinson, L. (2003). SeeWord: Rethinking interfaces. Insights from word-processing 
software for dyslexic readers. In Proceedings of INTERACT 2003, Zurich, Switzerland, 615-622. 
Dillon, A. (2001). The evaluation of software usability. In W. Karwowski (Ed.), Encyclopedia of Human Factors 
and Ergonomics, (pp. 1-6). London: Taylor and Francis. 
Dirica, A.C. & Gӧktürk, M. (2009). Attentive Interfaces. In I. Maurtua (Ed.), Human-Computer Interaction, 
InTech. 
Ditchburn, R.W. & Ginsborg, B.I. (1953). Involuntary eye movements during fixation. Journal of Physiology, 
119, 1-17. 
Dix, A., Finlay, J., Abowd, G. & Beale, R. (1993). Human-computer interaction. New Jersey: Prentice-Hall. 
Douglas, S.A., Kirkpatrick, A.E. & MacKenzie, I.S. (1999). Testing pointing device performance and user 
assessment with the IS0 9241, Part 9 Standard. In Proceedings of CHI ‘99, Pennsylvania, United States of 
America, 215-222. 
Dragon Naturally Speaking. (nd). History of speech and voice recognition and transcription software. Retrieved 
13 February 2009 from http://www.nuance.com. 
210 
 
 References 
Drewes, H., De Luca, A. & Schmidt, A. (2007). Eye-gaze interaction for mobile phones. In Proceedings of the 4th 
international conference on mobile technology, applications, and systems and the 1st international 
symposium on Computer human interaction in mobile technology, 364-371. 
Drewes, H. & Schmidt, A. (2007). Interacting with the computer using gaze gestures. In Proceedings of Interact 
'07, Rio De Janeiro, Brazil, 475-488. 
Drewes, H & Schmidt, A. (2009). The MAGIC touch: Combining MAGIC-pointing with a touch-sensitive mouse. 
In Human-Computer Interaction - INTERACT 2009. 12th IFIP TC 13 International Conference, Part II, 
Uppsala, Sweden, 415-428. 
Duchowski, A.T. (2002). A breadth-first survey of eye tracking applications. Behavior Research Methods, 
Instruments, and Computers, 34(4), 455-70. 
nd
Duchowski, A.T. (2007). Eye tracking methodology: Theory and practice 2  Edition. London: Springer-Verlag. 
Duchowski, A.T., Cournia, N. & Murphy, H. (2004). Gaze-contingent displays: A review. CyberPsychology and 
Behavior, 7(6), 621-634. 
Dvorak, J.L. (2007). Moving wearables into the mainstream: Taming the Borg. United States: Springer. 
Edwards, A.L. (1951). Balance Latin-square designs in psychological research. The American Journal of 
Psychology, 64(4), 598-603. 
Eisenberg, D. (1992). Word Processing (History of). In Encyclopedia of Library and Information Science, 49, 268-
278. New York: Dekker. 
Ekman, I., Poikola, A., Mäkäräinen, M., Takala, T. & Hämäläinen, P. (2008). Voluntary pupil size change as 
control in eyes only interaction. In Proceedings of the 2008 Symposium on Eye Tracking Research and 
Applications (ETRA), Savannah, Georgia, United States of America, 115-118. 
Faulkner, C. (1998). The essence of human-computer interaction. Great Britain: Prentice-Hall. 
Fejtová, M., Fejt, J., & Lhotská, L. (2004). Controlling a PC by eye movements: The MEMREC project. In 
Proceedings of the 9th International Conference on Computers Helping People with Special Needs (ICCHP 
‘04), 770-773. Berlin: Springer. 
Felton, E.A., Lewis, N.L., Wills, S.A., Radwin, R.G. & Williams, J.C. (2007). Neural signal based control of the 
rd
Dasher writing system. In Proceedings of the 3  International IEEE EMBS Conference on Neural 
Engineering, Hawaii, United States of America, 366-370. 
Feng, J. & Sears, A. (2004). Are we speaking slower than we type? Exploring the gap between natural speech, 
typing and speech-based dictation. Accessibility and Computing, 79, 6-9. 
Field, A. (1998). A bluffer’s guide to ... sphericity. The British Psychological Society: Mathematical, Statistical 
and Computing Section Newsletter, 6, 13-22. 
Fitch, W.T. (2000). The evolution of speech: A comparative review. Trends in Cognitive Science, 4(7), 258-267. 
211 
 
 References 
Fitts, P.M. (1954). The information capacity of the human motor system in controlling the amplitude of 
movement. Journal of Experimental Psychology, 47(6), 381-391. 
Foley, J.D., Van Dam, A., Feiner, S.K. & Hughes, J.F. (1990). Computer graphics: Principles and practice. Reading, 
Massachusetts: Addison-Wesley. 
Forlines, C., Schmidt-Nielsen, B., Raj, B., Wittenburg, K. & Wolf, P. (2005). A comparison between spoken 
queries and menu-based interfaces for in-car digital music selection. In Proceedings of International 
Conference on Human-Computer Interaction (INTERACT 2005), 12-16. 
Forsberg, M. (2003). Why is speech recognition difficult? Technical Report, Chalmers University of Technology. 
th
Freedman, A. (1998). The Computer Glossary (8  Edition). United States of America: AMACOM. 
Freudenthal, A., Keyson, D.V., DeKoven, E. & De Hoogh, M.P.A.J. (2001). Communicating extensive smart home 
functionality to users of all ages: the design of a mixed-initiative multimodal thermostat interface. In 
OIKOS 2001 Workshop: Methodological Issues in the Design of Household Technologies, Molslaboratoriet, 
Denmark, 34-39. 
Fry, E.B., Kress, J.E. & Fountoukidis, D.L. (2003). The reading teacher’s book of lists. United States of America: 
Center for Applied Research in Education. 
Furnas, G.W. (1986). Generalized fisheye views. In Proceedings of the SIGCHI conference on Human factors in 
computing systems, Boston, United States of America, 16-23. 
Gajos, K.Z., Wobbrock, J.O. & Weld, D.S. (2008). Improving the Performance of Motor-Impaired Users with 
Automatically-Generated, Ability-Based Interfaces. In Proceedings of CHI 2008, Florence, Italy, 1257-1266. 
Gips, J. & Olivieri, P. (1996). EagleEyes: An eye control system for persons with disabilities. In Proceedings of 
th
the 11  International Conference on Technology and Persons with Disabilities, Los Angeles, United States 
of America. 
Girden, E. R. (1992). ANOVA: Repeated measures. Newbury Park, California: Sage. 
Glenstrup, A.J. & Engell-Nielsen, T. (1995). Eye controlled media: Present and future state. Bachelor’s Degree 
Thesis, University of Copenhagen. 
Gorniak, P. & Roy, D. (2003). Augmenting user interfaces with adaptive speech commands. In Proceedings of 
ICMI ’03, Vancouver, Canada, 176-179. 
Gregory, R.L. (1966). The eye and the brain: The psychology of seeing. London: World University Library. 
Griffin, Z. (2001). Gaze durations during speech reflect word selection and phonological encoding. Cognition, 
82, B1–B14. 
Griffin, Z. M. & Bock, K. (2000). What the eyes say about speaking. Psychological Science, 11, 274–279. 
 
212 
 
 References 
Haigh, T. (2006). Remembering the office of the future: The origins of word processing and office automation. 
IEEE Annals of the History of Computing, 28(4), 6-31. 
Haller, R., Mutschler, H. & Voss M. (1984). Comparison of input devices for correction of typing errors in office 
systems. Proceedings of INTERACT '84, First IFIP Conference on Human-Computer Interaction, London, 
United Kingdom, 177-182. 
Hansen, J.P., Hansen, D.W., & Johansen, A.S. (2001). Bringing gaze-based interaction back to basics. In C. 
Stephanidis (Ed.) Universal Access in HCI (UAHCI): Towards an Information Society for All - Proceedings of 
the 9th International Conference on Human-Computer Interaction (HCII‘01), 325-328. Mahwah, NJ: 
Lawrence Erlbaum Associates.  
Hansen, D.W. & Ji, Q. (2010).  In the eye of the beholder: A survey of models for eyes and gaze. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 23(3), 478-500. 
Hansen, J.P., Johansen, A.S., Hansen, D.W., Itoh, K. & Mashino, S. (2003). Command without a click: Dwell time 
typing by mouse and gaze selections. In Proceedings of Human-Computer Interaction (INTERACT ’03), 
Zurich, Switzerland, 121-128. 
Haro, A., Essa, I. & Flickner, M. (2000). A non-invasive computer vision system for reliable eye tracking. In 
Proceedings of CHI ’00, The Hague, Netherlands, 167-168. 
Harper, B.D. & Norman, K.L. (1993). Improving user satisfaction: The questionnaire for user interaction 
st
satisfaction version 5.5. In Proceedings of the 1  annual Mid-Atlantic Human Factors Conference, Virginia 
Beach, Virginia, United States of America, 224-228. 
Hassenzahl, M. & Tractinsky, N. (2006). User experience – a research agenda. Behaviour and Information 
Technology, 25(2), 91-97. 
Hatfield, F. & Jenkins, E.A. (1997). An interface integrating eye gaze and voice recognition for hands-free 
computer access. In Proceedings of the CSUN 1997 Conference, 1-7. 
Hauptmann, A.G. (1989). Speech and Gestures for Graphic Image Manipulation. In Proceedings of the 
International Conference on Human-Computer Interaction, 241-245. 
He, T. & Kaufman, A.E. (1997). Virtual input devices for 3D systems. In Proceedings of IEEE Visualization '93, 
San Jose, California, 142-148. 
Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka, H. & Van de Weijer, J. (In press). Eye 
tracking: A comprehensive guide to methods and measures. London: Oxford University Press. 
Hornof, A., Cavender, A. & Hoselton, R. (2004). EyeDraw: A system for drawing pictures with eye movements. 
In Proceedings of ASSETS ’04, Atlanta, Georgia, United States of America, 86-93. 
Huckauf, A. & Urbina, M.H. (2008). Gazing with pEyes: Towards a universal Input. In Proceedings of the 2008 
Symposium on Eye Tracking Research and Applications (ETRA), Savannah, Georgia, United States of 
America, 51-54. 
213 
 
 References 
Hwang, F., Keates, S., Langdon, P. & Clarkson, J. (2004). Mouse movements of motion-impaired users: A 
submovement analysis. In Proceedings of ASSETS ’04, Atlanta, Georgia, United States of America, 102-
109. 
Hyrskykari, A. (1997). Gaze Control as an Input Device. In Proceedings of ACHCI ’97, University of Tampere, 22-
27. 
Hyrskykari, A., Majaranta, P. & Räihä, K-J. (2003). Proactive response to eye movements. In Proceedings of 
INTERACT '03, Zurich, Switzerland, 129-136. 
ISO9241. (1998). ISO 9241-11: Guidance on usability. International Organization for Standardization. 
ISO. (2000). ISO 9241-9: Ergonomic requirements for office work with visual display terminals (VDTs) – Part 9: 
Requirements for non-keyboard input devices. International Organization for Standardization. 
ISO9241-210:2010. (2010). Ergonomics of human-system interaction – Part 210: Human-centred design for 
interactive systems. International Organization for Standardization. 
Isokoski, P. (2000). Text input methods for eye trackers using off-screen targets. In Proceedings of the 2000 
Symposium on Eye Tracking Research and Applications (ETRA), Palm Beach Gardens, Florida, United 
States of America, 15-21. 
Istance, H.O., Spinner, C. & Howarth, P.A. (1996). Providing motor impaired users with access to standard 
st
Graphical User Interface (GUI) software via eye-based interaction. In Proceedings of 1  European 
Conference on Disability, Virtual Reality and Associated Technology, Maidenhead, United Kingdom, 109-
116. 
Jacob, R.J.K. (1991). The use of eye movements in human-computer interaction techniques: What you look at 
is what you get, ACM Transactions on Information Systems, 9(2), 152-169. 
Jacob, R.J.K. (1993a). Eye Movement-Based Human-Computer Interaction Techniques: Toward Non-Command 
Interfaces. In H.R. Hartson and D. Hix (Eds), Advances in Human-Computer Interaction, 4, 151-190. 
Norwood, New Jersey: Ablex Publishing. 
Jacob, R.J.K. (1993b). What you look at is what you get: Using eye movements as computer input. In 
Proceedings of Virtual Reality Systems '93 conference and exhibition, New York, New York, United States 
of America, 164-166. 
Jacob, R.J.K. (1995a). Eye tracking in advanced interface design. In W. Barfield & T.A. Furness (Eds.), Virtual 
Environments and Advanced Interface Design (pp. 258-288). New York: Oxford University Press. 
Jacob, R.J.K. (1995b) Natural Dialogue in Modes other than Natural Language. In R.J. Beun, M. Baker & M. 
Reiner (Eds), Dialogue and Instruction (pp. 289-301). Berlin: Springer-Verlag. 
Jacob, R.J.K. & Karn, K.S. (2003). Eye tracking in human-computer interaction and usability research: Ready to 
deliver the promises (Section Commentary). In J. Hyona, R. Radach & H. Deubel (Eds) The Mind's Eye: 
Cognitive and Applied Aspects of Eye Movement Research (pp. 573-605). Amsterdam: Elsevier Science. 
214 
 
 References 
Jaimes, A. & Sebe, N. (2005). Multimodal human computer interaction: A survey. IEEE workshop on human 
computer interaction, Las Vegas, Nevada, United States of America, 15-21. 
Jönsson, E. (2005). If looks could kill – An evaluation of eye tracking in computer games. Master’s Thesis, Royal 
Institute of Technology, Stockholm, Sweden. 
Jurafsky, J.H.M.D. (2000). Speech and language processing: An introduction to Natural Language Processing, 
Computational Linguistics and Speech recognition. New Jersey: Prentice Hall. 
Just, M.A. & Carpenter, P.A. (1976). Eye fixations and cognitive processes. Cognitive Psychology, 8, 441-480. 
Kammerer, Y., Scheiter, K. & Beinhauer, W. (2008). Looking my way through the menu: The impact of menu 
design and multimodal input on gaze-based menu selection. In Proceedings of the Symposium on Eye 
Tracking Research and Applications (ETRA), Savannah, Georgia, United States of America, 213-220. 
Karimullah, A.S. & Sears, A. (2002). Speech-based cursor control. In Proceedings of ASSETS ’02, Edinburgh, 
Scotland, 178-185. 
Karl, L., Pettey, M. & Shneiderman, B. (1993). Speech-activated versus mouse-activated commands for word 
processing applications: An empirical evaluation. International Journal of Man-Machine Studies, 39, 667-
687. 
Karpov, A., Carbini, S., Ronzhin, A. & Viallet, J.E. (2008). Comparison of two different similar speech and 
gestures multimodal interfaces. In Proceedings 16th European Signal Processing Conference, Lausanne, 
Switzerland. 
Kaukènas, J., Navickas, G. &Telksnys, L. (2006). Human-computer audiovisual interface. Information technology 
and control, 35(2), 87-93. 
Kaur, M., Tremaine, M., Huang, N., Wilder, J., Gacovski, Z., Flippo, F. & Mantravadi, S. (2003). Where is “it”? 
Event synchronization in gaze-speech input systems. In Proceedings of ICIM ’03, Vancouver, Canada, 151 - 
158.  
Keates, S., Hwang, F., Langdon, P., Clarkson, P.J. & Robinson, P. (2002). Cursor movements for motion-impaired 
computer users. In Proceedings of ASSETS ’02, Edinburgh, Scotland, 135-142. 
Keates, S. & Trewin, S. (2005). Effect of age and Parkinson’s Disease on cursor positioning using a mouse. In 
Proceedings of ASSETS ’05, Baltimore, Maryland, United States of America, 68-75. 
Klarlund, N. (2003). Editing by Voice and the Role of Sequential Symbol Systems for Improved Human-to-
Computer Information Rates. In Proceedings of ICASSP, Hong Kong, 553-556. 
Klarlund, N. & Riley, M. (2003). Word n-grams for cluster keyboards. In Proceedings of the 11th Conference of 
the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 51-58. 
Kumar, M. (2006). Reducing the cost of eye tracking systems. Technical Report CSTR 2006-08, Stanford HCI 
Group. 
215 
 
 References 
Kumar, M. (2007), GUIDe Saccade Detection and Smoothing Algorithm. Technical Report CSTR 2007-03, 
Stanford HCI Group.  
Kumar, M., Klinger, J., Puranik, R., Winograd, T. & Paepcke, A. (2008). Improving the accuracy of gaze input for 
interaction. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA), 
Savannah, Georgia, United States of America, 65-68. 
Kumar, M., Paecke, A. & Winograd, T. (2007). EyePoint: Practical pointing and selection using gaze and 
keyboard. In Proceedings of CHI 2007, San Jose, California, United States of America, 421-430. 
Kumar, M. & Winograd, T. (2007). GUIDe: Gaze-enhanced UI Design. In Proceedings of CHI 2007, San Jose, 
California, United States of America, 1977-1982. 
Land, M.F. & Tatler, B.W. (2009). Looking and acting: Vision and eye movements in natural behaviour. United 
States of America: Oxford University Press. 
Laqua, A., Bandara, S.U. & Sasse, M.A. (2007). GazeSpace: Eye gaze controlled content spaces. In Proceedings 
of HCI 2007, Beijing, China, 55-58. 
Latoschik, M.E., Frӧhlich, M., Jung, B. & Wachsmuth. I. (1998) Utilize speech and gestures to realize natural 
interaction in a virtual environment. In IECON’98 – Proceedings of the 24th Annual Conference of the IEEE 
Industrial Electronics Society, 2028–2033. 
Leggett, J. & Williams, G. (1984). An empirical investigation of voice as an input modality for computer 
programming. International Journal of Man-Machine Studies, 21(6), 493-520.  
Levenshtein, V.I. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Doklady 
Akademii Nauk, 163, 845-848. 
Li, D., Winfield, D. & Parkhurst, D.J. (2005). Starbursts: A hybrid algorithm for video-based eye tracking 
combining feature-based and model-based approaches. In Proceedings of the IEEE Vision for Human-
Computer Interaction Workshop at CVPR, Beijing, China, 1-8.  
Liu, Y., Chai, J, Y. & Jin, R. (2007). Automated vocabulary acquisition and interpretation in multimodal 
conversational systems. In Proceedings of the 45th Annual Meeting of the Association for Computational 
Linguistics. 
Liu, X., Crump, M.J.C. & Logan, G.D. (2010). Do you know where your fingers have been? Explicit knowledge of 
the spatial layout of the keyboard in skilled typists. Memory and Cognition28(4), 474-484. 
Logan, G.D. & Crump, M.J.C. (2009). The left hand doesn’t know what the right hand is doing: The disruptive 
effects of attention to the hands in skilled typewriting. Psychological Science, 20(10), 1296-1300. 
MacKenzie, I.S. (2002). A note on calculating text entry speed. Retrieved 14 June 2010 from 
http://www.yorku.ca/mack/RN-TextEntrySpeed.html. 
216 
 
 References 
MacKenzie, I.S., Kauppinen, T. & Silfverberg, M. (2001). Accuracy measures for evaluating computer pointing 
devices. In Proceedings of SIGCHI ’01, Seattle, Washington, United States of America, 9-16. 
MacKenzie, I.S. & Soukoreff, R.W. (2002). A character-level error analysis technique for evaluating text entry 
methods. In Proceedings of NordiCHI 2002, Aarhus, Denmark, 243-246. 
MacKenzie, I.S. & Soukoreff, R.W. (2003). Phrase sets for evaluating text entry techniques. In Extended 
Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2003, Fort Lauderdale, 
Florida, United States of America, 754-755. 
Maglio, P.P., Matlock, T., Campbell, C.S., Zhai, S. & Smith, B.A. (2000). Gaze and speech in attentive user 
interfaces. In Proceedings of the Third International Conference on Advances in Multimodal Interfaces, 
Vancouver, Canada, 1-7. 
Majaranta, P. (2009). Text entry by eye gaze. Dissertations in Interactive Technology, number 11, University of 
Tampere. 
Majaranta, P., Ahola, U-K. & Špakov, O. (2009). Fast gaze typing with an adjustable dwell time. In Proceedings 
th
of the 27  International Conference on Human Factors in Computing Systems, Boston, Massachusetts, 
United States of America, 357-360. 
Majaranta, P., MacKenzie, I.S., Aula, A. & Räihä, K-J. (2006). Effects of dwell time on eye typing and accuracy. 
Universal Access in the Informational Society, 5(2), 199-208. 
Majaranta, P. & Räihä, K.-J. (2007). Text entry by gaze: Utilizing eyetracking. In I. S. MacKenzie and K. Tanaka-
Ishii (Eds.) Text Entry Systems: Mobility, Accessibility, Universality, 175-187. San Francisco: Morgan 
Kaufmann. 
Man, D.W.K. & Wong, M-S, L. (2007). Evaluation of computer-access solutions for students with quadriplegic 
athetoid cerebral palsy. American Journal of Occupational Therapy, 61, 355-364. 
Martinez-Conde, S. & Macknick, S.L. (2008). Fixational eye movements across vertebrates: Comparative 
dynamics, physiology, and perception. Journal of Vision, 8(14), 1-16. 
Martinez-Conde, S., Macknik, S.L. & Hubel, D.H. (2004). The role of fixational eye movements in visual 
perception. Nature Reviews Neuroscience, 5(3), 229-240. 
Maxwell, S. E., & Delany, H. D. (2004). Designing experiments and analyzing data: A model comparison 
perspective (2nd ed.). Mahwah, New Jersey: Lawrence Erlbaum Associated, Publishers. 
Microsoft (nd). Microsoft Speech API. Retrieved 4 May 2010 from http://msdn.microsoft.com/en-us/library/ 
ms723627(VS.85).aspx. 
Minke, A. (1997). Conducting repeated measures analyses: Experimental design considerations. In Proceedings 
of the Annual Meeting of the Southwest Educational Research Association, Austin, Texas, 23-25. 
217 
 
 References 
Miniotas, D. (2000). Application of Fitts’ Law to eye gaze interaction. In Proceedings of CHI 2000, The Hague, 
Netherlands, 339-340. 
Miniotas, D. & Špakov, O. (2004). Target expansion as a means to facilitate eye-based selection. Elektronika Ir 
Elektrotechnika, 3(25), 13-17. 
Miniotas, D., Špakov, O. & Evreinov, G. (2003). Symbol Creator: An alternative eye-based text entry technique 
with low demand for screen space. In Proceedings of Human Computer Interaction – INTERACT ’03, 
Zurich, Switzerland, 137-143. 
Miniotas, D., Špakov, O. & MacKenzie, I.S. (2004). Eye gaze interaction with expanding targets. In Extended 
abstracts of the ACM Conference of Human Factors in Computing Systems – CHI 2004, Vienna, Austria, 
1255-1258. 
Miniotas, D., Špakov, O., Tugoy, I. & MacKenzie, I.S. (2006). Speech-Augmented Eye Gaze Interaction with 
Small Closely Spaced Targets. In Proceedings of the 2006 Symposium on Eye Tracking Research and 
Applications (ETRA), 67-72. 
Morimoto, C.H. & Amir, A. Context switching for fast key selection in text entry applications. In Proceedings of 
the 2010 Symposium on Eye Tracking Research and Applications (ETRA), 271-274. 
Morrison, D.L., Green, T.R.G., Shaw, A.C. & Payne, S.J. (1984). Speech-controlled text-editing: effects of input 
modality and of command structure. International Journal of Man-Machine Studies, 21, 49-63.  
Motulsky, H. (1995). Intuitive biostatics: Choosing a statistical test. United States of America: Oxford University 
Press. 
Murata, A. (2006). Eye-Gaze Input Versus Mouse: Cursor Control as a Function of Age. International Journal of 
Human-Computer Interaction, 21(1), 1-14. 
Natapov, D., Castellucci, S.J. & MacKenzie, I.S. (2009). ISO 9241-9 evaluation of video game controllers. In 
Proceedings of Graphics Interface Conference, Kelowna, British Columbia, Canada, 223-230. 
Nelson, D. L. (1986) User acceptance of voice recognition in a product inspection environment. The Official 
Proceedings of Speech Tech ’86: Voice Input / Output Applications Show and Conference, p. 62. 
Nielsen, J. (1993). Noncommand user interfaces. Retrieved 11 March 2011 from 
http://www.useit.com/papers/ noncommand.html. 
Nielsen, J. (2000). Why you only need to test with 5 users. Alertbox, 19 March, 2000. Retrieved 7 June 2010 
from http://www.useit.com/alertbox/20000319.html. 
Nielsen, J. (2001a). Usability metrics. Alertbox, January, 2001. Retrieved 7 June 2010 from 
http://www.useit.com/alertbox/ 20010121.html. 
Nielsen, J. (2001b). Success rate: The simplest usability metric. Alertbox, February, 2001. Retrieved 7 June 2010 
from http://www.useit.com/alertbox/20010218.html. 
218 
 
 References 
Nielsen, J. (2006). Quantitative Studies: How Many Users to Test? Alertbox, 26 June, 2006. Retrieved 7 June 
2010 from http://www.useit.com/alertbox/quantitative_testing.html. 
Nijholt, A. & Tan, D. (2008). Brain-computing interfacing for intelligent systems. IEEE Intelligent Systems, 23(3), 
72-79. 
Nimon, K. & Williams, C. (2009). Evaluating performance improvement through repeated measures: A primer 
for educators considering univariate and multivariate designs. Research in Higher Education Journal, 2, 
28-48. 
Nusbaum, H.C., DeGroot, J & Lee, L. (1995). Using speech recognition systems: Issues in cognitive engineering. 
In A.Sydral, R. Bennett and S.Greenspan (Eds), Applied Speech Technology (pp. 127-194). Boca Raton, 
Florida: CRC Press. 
Nye, J. M. (1982). Human factors analysis of speech recognition systems. Speech Technology, 1(2), 50-57.  
Olivier, M. (2004). Information technology research: A practical guide for Computer Science and Informatics 
nd
(2  Edition). Pretoria: Van Schaik. 
O’Shaughnessy, D. (1995). Speech Technology. In A.Sydral, R. Bennett and S.Greenspan (Eds), Applied Speech 
Technology (pp. 47-98). Boca Raton, Florida: CRC Press. 
Oviatt, S. (1999). Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of 
the ACM SIGCHI 99, Pittsburgh, Pennsylvania, United States of America, 576-583.  
Oviatt, S. & Cohen, P. (2000). Multimodal interfaces that process what comes naturally. Communications of the 
ACM, 43(2), 45-53. 
Oviatt, S., Cohen, P., Wu, L.Z., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., 
Larson, J. & Ferro, D. (2000). Designing the user interface for multimodal speech and pen-based gesture 
applications: State-of-the-art systems and future research directions. Human-Computer Interaction, 15 
(4), 263-322. 
Oviatt, S., MacEachern, M. & Levow, G. (1998). Predicting hyperarticulate speech during human-computer 
error resolution. Speech Community, 24(2), 87-110. 
Oxford Dictionaries. (2011). Oxford Dictionaries. England: Oxford University Press. Last accessed 14 July 2011 
at http://oxforddictionaries.com. 
Oyekoya, O.K. & Stentiford, F.W.M. (2006). Eye tracking – A new interface for visual exploration.  BT 
Technology Journal, 24(3), 57-66. 
Paluch, K. (2009). What is user experience design. Last accessed 23 July 2011 at http://www.montparnas.com/ 
articles/what-is-user-experience-deisgn/print.   
Pireddu, A. (2007). Multimodal Interaction: An integrated speech and gaze approach. Thesis, Politecnico di 
Torino. 
219 
 
 References 
Poock, G. K. (1982). Voice recognition boosts command terminal throughput. Speech Technology, 1(2), 36-39.  
Porta, M. & Turina, M. (2008). Eye-S: a full-screen input modality for pure eye-based communication. In 
Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA), Savannah, Georgia, 
United States of America, 27-34. 
Poulton, E.C. & Freeman, P.R. (1966). Unwanted asymmetrical transfer effects with balanced experimental 
designs. Psychological Bulletin, 66, 1-8. 
Prasov, Z., Chai, J.Y. & Jeong, H. (2007). Eye gaze for attention prediction in multimodal human-machine 
conversation. In Proceedings of AAAI Spring Symposium on Interaction Challenges for Intelligent 
Assistants. 
Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S. & Carey, T. (1994). Human-computer interaction. 
England: Addison-Wesley. 
Quadriplegic Association of South Africa. (nd). Available from http://quad.stormnet.co.za/index.htm.  
Qvarfordt, P., Beymer, D. & Zhai, S. (2005). RealTourist – A Study of Augmenting Human-Human and Human-
Computer Dialogue with Eye-Gaze Overlay. In Proceedings of INTERACT 2005, Rome, Italy, 767-780. 
Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological 
Bulletin, 124(3), 372-422. 
Read, J. (2005). On the application of text input metrics to handwritten text input. Text Input Workshop, 
Dagstuhl., Germany. 
Read, J., MacFarlane, S. & Casey, C. (2001). Measuring the usability of text input methods for children. In 
Proceedings of Human-Computer Interaction (HCI) 2001, New Orleans, United States of America, 559-572. 
Reese, H.W. (1997). Counterbalancing and other uses of repeated-measures Latin-square designs: Analyses 
and interpretations. Journal of Experimental Child Psychology, 64, 137-158. 
Rosson, M.B. (1984). Effects of experience on learning, using, and evaluating a text editor. Human Factors, 
26(4), 463-475. 
Rubinoff, R. (nd). How to quantify the user experience. Last accessed 24 July 2011 at 
http://www.sitepoint.com/quantify-user-experience. 
rd
Russel, S.J. & Norvig, P. (2009). Artificial Intelligence: A modern approach (3  Edition). Prentice Hall. 
Schmandt, C., Ackerman, M.S. & Hindus, D. (1990). Augmenting a window system with speech input. 
Computer, 23(8), 50-56. 
Schnell, T. (2000). Applying eye tracking as an alternative approach for activation of controls and functions in 
aircraft. In Proceedings of Digital Avionics Systems Conferences, Washington, DC, United States of 
America, 19(2), 5A5/1-5A5/9. 
Scholtz, J. (2004). Usability evaluation. Publication #545, National Institute of Standards and Technology. 
220 
 
 References 
Sears, A., Karat, C-M., Oseitutu, K., Karimullah, A. & Feng, J. (2001). Productivity, satisfaction, and interaction 
strategies of individuals with spinal cord injuries and traditional users interacting with speech recognition 
software. Universal Access in the Information Society, 1(4), 4-15. 
Sears, A., Lin, M. & Karimullah, A.S. (2002). Speech-based cursor control: Understanding the effects of target 
size, cursor speed, and command selection. Universal Access in the Information Society, 2(1), 30-43. 
Shackel, B. (1991). Usability – Context, framework, design and evaluation. In B. Shackel & S. Richardson (Eds), 
Human factors for informatics usability (pp.21-38). Cambridge: Cambridge University Press. 
Shell, J.S., Bradbury, J.S., Knowles, C.B., Dickie, C. & Vertegaal, R. (2003a). eyeCook: A gaze and speech enabled 
attentive cookbook. In Video Proceedings of Ubiquitous Computing (Ubicomp), Seattle, Washington, 
United States of America. 
Shell, J.S., Vertegaal, R., Mamuji, A., Pham, T., Sohn, C. & Skaburskis, A. (2003b). EyePliances and 
EyeReason: Using Attention to Drive Interactions with Ubiquitous Appliances. In Extended Abstracts of 
UIST, Vancouver, Canada. 
Shneiderman, B. (1998). Designing the user interface: Strategies for effective human-computer interaction (3rd 
Edition). Massachusetts: Addison-Wesley. 
Shneiderman, B. (2000). The limits of speech recognition. Communications of the ACM, 43(9), 63-65. 
Sibert, L.E. & Jacob, R.J.K. (2000). Evaluation of Eye Gaze Interaction. In Proceedings of ACM CHI 2000: Human 
Factors in Computing Systems Conference, The Hague, Netherlands, 281-288. 
Simon, C. (2002). Com objects, C# and the Microsoft speech API. Dr Dobb’s Journal, September 2002. 
Smith, B.A., Ho, J., Ark, W. & Zhai, S. (2000). Hand eye coordination patterns in target selection. In Proceedings 
of the Eye Tracking Research and Application Symposium (ETRA), Palm Beach Gardens, Florida, United 
States of America, 117-122. 
Soukoreff, R. W. & MacKenzie, I. S. (2001). Measuring errors in text entry tasks: An application of the 
Levenshtein string distance statistic. In Extended Abstracts of the ACM Conference on Human Factors in 
Computing Systems (CHI ’01), Seattle, Washington, United States of America, 319-320. 
Soukoreff, R.W. & MacKenzie, I.S. (2004). Towards a standard for pointing device evaluation, perspectives on 
27 years of Fitts’ Law research in HCI. International Journal of Human-Computer Studies, 61, 751-789. 
Špakov, O. (2005) EyeChess: The tutoring game with visual attentive interface. In Proceedings of Alternative 
Access: Feelings & Games, University of Tampere, Finland, 81-86. 
th
Špakov, O. & Majaranta, P. (2008). Scrollable keyboards for eye typing. In Proceedings of the 4  Conference on 
Communication by Gaze Interaction (COGAIN), Prague, Czech Republic, 63-66. 
Špakov, O. & Miniotas, D. (2003). An algorithm for adjustable dwell time in eye typing systems. Information 
Technology and Control, 2(31), 49-52. 
221 
 
 References 
th
Špakov, O. & Miniotas, D. (2005). Gaze-based selection of standard-size menu items. In Proceedings of 7  
International Conference on Multimodal Interfaces (ICMI), Trento, Italy, 124-128. 
Stampe, D.M. & Reingold, E.M. (1995). Selection by looking: A novel computer interface and its application to 
psychological research. In J.M. Findlay, R. Walker & R.W. Kentridge (Eds), Eye movement research: 
Mechanisms, processes and applications (pp. 467-478). Amsterdam: Elsevier Science Publishers. 
StatSoft, Inc. (2010). Electronic Statistics Textbook. Last accessed 21 May 2011 at http://www.statsoft.com/ 
textbook/. 
Stiefelhagen, R. & Yang, J. (1997). Gaze Tracking for Multimodal Human-Computer Interaction. In Proceedings 
of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Munich, 
Germany, 140-147. 
Su, M-C., Su, S-Y. & Chen, G-D. (2005). A low-cost vision-based human-computer interface for people with 
severe disabilities. Biomedical Engineering: Applications, Basis and Communications, 17(6), 284-292. 
Suhm, B. (2008). IVR usability engineering using guidelines and analyses of end-to-end calls. In D. Gardner-
Bonneau and H.E. Blanchard (Eds), Human factors and voice interactive systems (pp. 1-42). New York, NY: 
Springer Science+Business Media. 
Tan, Y.K., Sherkat, N. & Allen, T. (2003a). Eye gaze and speech for data entry: A comparison of different data 
entry methods. In Proceedings of the International Conference on Multimedia and Expo, Baltimore, 
Maryland, United States of America, 41-44. 
Tan, Y.K., Sherkat, N. & Allen, T. (2003b). Error recovery in a blended style eye gaze and speech interface. In 
Proceedings of ICMI ’03, Vancouver, Canada, 196-202. 
Tanaka, K. (1999). A robust selection system using realtime multi-modal user-agent interactions. In 
Proceedings of IUI’99, 105-108. 
Tanenhaus, M. K., Spivey-Knowlton, M., Eberhard, K. & Sedivy, J. (1995). Integration of visual and linguistic 
information during spoken language comprehension. Science, 268, 1632-1634. 
Ten Kate, J.H., Frietman, E.E.E., Willems, W., Ter Haar Romeny, B.M., & Tenkink, E. (1979). Eye-switch 
controlled communication aids. In Proceedings of the 12th International Conference on Medical and 
Biological Engineering, Jerusalem, Israel, 19-20. 
Thomas, J.C., Basson, S. & Gardner-Bonneau, D. (2008). Accessibility and speech technology: Advancing toward 
universal access. In D. Gardner-Bonneau and H.E. Blanchard (Eds), Human factors and voice interactive 
systems (pp. 417-442). New York, NY: Springer Science+Business Media. 
Tobii. (2011). Tobii unveils the world’s first eye-controlled laptop. Retrieved 14 March 2011 from 
http://www.tobii.com/en/ eye-tracking-integration/global/news-and-events/press-releases/tobii-unveils-
the-worlds-first-eye-controlled-laptop/.  
222 
 
 References 
Tse, E., Greenberg, S., Shen, C. & Forlines, C. (2006). Multimodal multiplayer tabletop gaming. In Proceedings 
of PerGames, Dublin, Ireland, 139-148. 
Tuisku, O., Majaranta, P., Isokoski, P. & Räihä, K-J. (2008). In Proceedings of the Symposium on Eye Tracking 
Research and Applications (ETRA), Savannah, Georgia, United States of America, 19-26. 
Tullis, T. & Albert, B. (2008). Measuring the user experience. United States of America: Morgan Kaufmann 
Publishers. 
Tullis, T.S. & Stetson, J.N. (2004). A comparison of questionnaire for assessing website usability. In Connecting 
Communities: UPA, Network In Our Community, Minneapolis, Minnesota, United States of America. 
Turk, M. (2001). Perceptual user interfaces. In R. Earnshaw, R. Guedj, A. van Dam & J. Vince (Eds), Frontiers of 
human-centred computing, online communities and virtual environments (pp. 39-51). London: Springer-
Verlag. 
Turk, M. & Kölsch, M. (2004). Perceptual Interfaces. In G. Medioni and S.B. Kang (Eds), Emerging Topics in 
Computer Vision (pp. 358-403). Prentice Hall. 
Unger, R. & Chandler, C. (2009). A project guide to UX design: For user experience designers in the field or in 
the making. United States of America: New Riders Press. 
Van Dam, A. (2001). Post-Wimp user interfaces: The human connection. In R. Earnshaw, R. Guedj, A. van Dam 
and J. Vince (Eds), Frontiers of human-centred computing, online communities and virtual environments 
(pp. 163-178). London: Springer-Verlag. 
Velichkovsky, B. M., Sprenger, A., & Pomplun, M. (1997). Auf dem Weg zur Blickmaus: Die Beeinflussung der 
Fixationsdauer durch kognitive und kommunikative Aufgaben. In R. Liskowsky, B. M. Velichkowsky, & W. 
Wünschmann (Eds), Software-Ergonomie (pp. 317-327). 
Vergo, J. (1998). A statistical approach to multimodal natural language interaction. In Proceedings of the 
AAAI’98 Workshop on Representations for Multimodal Human-Computer Interaction, Madison, 
Wisconsin, United States of America, 81-85. 
Vertanen, K. & MacKay, D.J.C. (2010). Speech Dasher: Fast writing using speech and gaze. In Proceedings of CHI 
2010, Atlanta, Georgia, United States of America, 595-598. 
Wachs, J.P, Kӧlsch, M., Stern, H. & Edan, Y. (2011). Vision-based hand-gesture applications. Communications of 
the ACM, 54(2), 60-71. 
Ward, D.J., Blackwell, A.F. & MacKay, D.J.C. (2000). Dasher – a data entry interface using continuous gestures 
and language models. In Proceedings of UIST 2000: The 13th Annual ACM Symposium on User Interface 
Software and Technology, San Diego, California, United States of America, 129-137. 
Ware, C. & Mikaelian, H.H. (1987). An evaluation of an eye tracker as a device for computer input. In 
Proceedings of CHI, 183-188. 
223 
 
 References 
Whitley, E. & Ball, J. (2002). Statistics review 6: Nonparametric methods. Critical Care, 6, 509-513. 
Wixon, D. & Wilson, C. (1997). The usability engineering framework for product design and evaluation. In M.G. 
Herlander (Ed.), Handbook of human-computer interaction (pp. 653-688). Holland: Elsevier. 
Wobbrock, J.O. (2007). Measures of text entry performance. In I.S. MacKenzie & K. Tanaka-Ishii (Eds), Text 
entry systems: Mobility, Accessibility, Universability (pp. 47-74). San Francisco: Morgan Kaufmann. 
Wobbrock, J.O., Rubinstein, J., Sawyer, M.W. & Duchowski, A.T. (2008). Longitudinal evaluation of discrete 
consecutive gaze gestures for text entry. In Proceedings of the Symposium on Eye Tracking Research and 
Applications (ETRA), Savannah, Georgia, United States of America, 11-18. 
Word-english. (2003). Retrieved 14 July 2011 from http://www.world-english.org/english500.htm. 
wordiQ. (2010). Retrieved 16 August 2010 from http://www.wordiq.com. 
www.msu.edu. Retrieved 23 February 2011 from www.msu.edu. 
www.sci-info-pages.com. (nd). Retrieved 12 November 2010 from http://www.sci-info-pages.com. 
www.tobii.com. Tobii Eye-tracking Technology. Retrieved 2 February 2011 from www.tobii.com. 
Yale Medical Group. (nd). Accessed 23 January 2011 at www.yalemedicalgroup.org. 
Yankelovich, N. (2008). Using natural dialogs as the basis for speech interface design. In D. Gardner-Bonneau & 
H.E. Blanchard (Eds), Human factors and voice interactive systems (pp. 417-442). New York, NY: Springer 
Science+Business Media. 
Zhai, S., Morimoto, C. & Ihde, S. (1999). Manual And Gaze Input Cascaded (MAGIC) Pointing. In Proceedings of 
CHI ’99: ACM Conference on Human Factors in Computing Systems, Pittsburgh, Pennsylvania, United 
States of America, 246-253. 
Zhang, Q., Imamiya, A., Go, K. & Mao, X. (2004). Resolving ambiguities of a gaze and speech interface. In 
Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA), San Antonio, Texas, 
United States of America, 85-92. 
Zhang, X. & MacKenzie, I.S. (2007). Evaluating eye tracking with ISO 9241 – Part 9. In J. Jacko (Ed.), Human 
Computer Interaction, 779-788. 
 
224 
 
 
BIBLIOGRAPHY 
 
Andrews, S. (1991). Improving touchscreen keyboards: Design issues and a comparison with other devices. 
Interacting with computers, 3(3), 253-269. 
Barea, R., Boquete, L., Bergasa, L.M., López, E. & Mazo, M. (2003). Electro-oculography guidance of a 
wheelchair using eye movements codification. The International Journal of Robotics Research, 22, 641-
652. 
Berglund, A. & Qvarfordt, P. (2003). Error resolution strategies for interactive television speech interfaces. In 
Proceedings of Interact ’03, Zurich, Switzerland, 105-112. 
Cohen, M.H. Giangola, J.P. & Balogh, J. (2004). Voice User Interface Design. Boston: Addison-Wesley. 
Coleman, J. (2005). Introducing speech and language processing. Cambridge: University Press. 
Czerwinski, M., Smith, G., Regan, T., Meyers, B., Robertson, G. & Starkweather, G. (2003). Toward 
characterizing the productivity benefits of very large displays. In Proceedings of Interact ’03, Zurich, 
Switzerland, 9-16. 
Czerwinski, M., Robertson, G., Meyers, B., Smith, G., Robbins, D. & Tan, D. (2007). Large display research 
overview. In Proceedings of CHI ’06, Quebec, Canada, 69-74. 
Deng, L. & Huang, X. (2004). Challenges in adopting speech recognition. Communications of the ACM, 47(1), 
69-75. 
Drewes, H. (2006). Gaze tracking in HCI. In Proceedings of the First International Colloquium on Pervasive 
Computing. 
Farid, M., Murtagh, F. & Starck, J.L. (2002). Computer Display Control and Interaction Using Eye-Gaze. Journal 
of the Society for Information Display, 10(3), 289-293. 
Hatfield, F., Jenkins, E.A. & Jennings, M.W. (1996). Principles and Guidelines for the Design of Eye/Voice 
Interaction Dialogs. In Proceedings of the Third Annual Symposium on Human Interaction with Complex 
Systems, Dayton, Ohio, United States of America, 10-19. 
Holman, D. (2007). GazeTop: Interaction techniques for gaze-aware tabletops. In Proceedings of CHI 2007, San 
Jose, California, United States of America, 1657-1660. 
Hyrskykari, A., Majaranta, P. & Räihä, K-J. (2005). From gaze control to attentive interfaces. In Proceedings of 
Human-Computer Interaction International (HCII), Las Vegas, Nevada, United States of America. 
MacKenzie, I.S. (n.d.). ISO Testing of Computer Pointing Devices. Retrieved 1 February 2010 from 
http://www.yorku.ca/mack/. 
225 
 
 Bibliography 
MacKenzie, I.S. (2003). Motor behaviour models for human-computer interaction. In JM Carroll (Ed.) Toward a 
multidisciplinary science of human-computer interaction (pp. 27-54). San Francisco: Morgan Kaufmann. 
Milekic, S. (2003). The more you look the more you get: Intention-based interface using gaze-tracking. 
Museums and the Web ’03.  
Miniotas, D., Špakov, O., Tugoy, I. & MacKenzie, I.S. (2005). Extending the limits for gaze pointing through the 
use of speech. Information and Control, 34, 225-230. 
Modlitba, P. (2004). Audiovisual attentive user interfaces – Attending to the needs and actions of the user. T-
121.900, Seminar on user interfaces and usability. 
Nielsen, J. (1996). International usability testing. Retrieved 7 June 2010 from http://www.useit.com/papers/ 
international_usetest.html. 
Optimoz Project. (nd). Mouse gestures. Retrieved from http://optimoz.mozdev.org/gestures/. 
Oulasvirta, A. & Salovaara, A. (2004). A cognitive meta-analysis of design approaches to interruptions in 
intelligent environments. In Proceedings of ACM Conference on Human Factors in Computing Systems, 
Vienna, Austria, 1155-1158. 
Pernice, K. and Nielsen, J. (2009). Eyetracking Methodology: How to Conduct and Evaluate Usability Studies 
Using Eyetracking. Alertbox, August, 2009. Retrieved 7 June 2010 from http://www.useit.com/ 
eyetracking/methodology/eyetracking-methodology.pdf. 
Rauterberg, M. (nd). The complete history of HCI. Retrieved 9 February 2009 from 
http://www.idemployee.id.tue.nl/g.w.m.rauterberg/ presentations/HCI-history_files/ frame.htm. 
Roberts, T.L. & Moran, T.P. (1983). The evaluation of text editors: Methodology and empirical results. 
Communications of the ACM, 26(4), 265-283. 
Rosson, M.B. (1984). Characterizing freeform editing behavior. IBM Research Report RC 10550, IBM T. J. 
Watson Research Center, Yorktown Heights, New York. 
Sasangohar, F., MacKenzie, I. S., & Scott, S. D. (2009). Evaluation of mouse and touch input for a tabletop 
display using Fitts’ reciprocal tapping task.  In Proceedings of the 53rd Annual Meeting of the Human 
Factors and Ergonomics Society – HFES 2009, San Antonio, Texas, United States of America, 839-843.  
Sears, A., Feng, J. & Oseitutu, K. (2003). Hands-free, speech-based navigation during dictation: Difficulties, 
consequences and solutions. Human-Computer Interaction, 18, 229-257. 
Sears, A., Karat, C-M., Oseitutu, K., Karimullah, A. & Feng, J. (2001). Productivity, satisfaction, and interaction 
strategies of individual with spinal cord injuries and traditional users interacting with speech recognition 
software. Universal Access in the Information Society, 1(4), 4-15. 
Selker, T. (2004). Visual Attentive Interfaces. BT Technology Journal, 22(4), 146-150. 
226 
 
 Bibliography 
Soukoreff, R. W., & MacKenzie, I. S. (2003). Metrics for text entry research: An evaluation of MSD and KSPC, 
and a new unified error metric. In Proceedings of the ACM Conference on Human Factors in Computing 
Systems (CHI ’03), Fort Lauderdale, Florida, United States of America, 113-120. 
Sullivan, P. (1989). Human-computer interaction perspectives on word-processing issues. Computers and 
Composition, 6(3), 11-33. 
Tuisku, O., Majaranta, P., Isokoski, P. & Räihä, K-J. (2008). In Proceedings of the Symposium on Eye Tracking 
Research and Applications (ETRA), Savannah, Georgia, United States of America, 19-26. 
Vertegaal, R. (2002). Designing Attentive Interfaces. In Proceedings of the Symposium on Eye Tracking Research 
and Applications (ETRA), New Orleans, Louisiana, United States of America, 23-30. 
Vertegaal, R. (2003). Attentive user interfaces: Introduction. Communications of the ACM, 46(3), 31-22. 
Zajicek, A. & Morrissey, W. (2001). In A. Blandford, J. Vanderdonckt and P. Gray (Eds.), People and Computers 
XV – Interaction without frontiers, (pp. 503-558). London, Great Britain: Springer. 
 
  
227 
 
 
APPENDIX A 
FEASIBILITY STUDY PRE-TEST QUESTIONNAIRE 
 
University of the Free State 
Department of Computer Science and Informa  tics
Multimodal word processors 
Pre-Test Questionnaire 
 
1. Name and surname:          
 
2. Age:    
 
3. Highest qualification:         
 
4. Which fields do you specialise in? 
  
  
 
5. Do you understand what is meant by usability? 
Yes No 
6. Do you understand what is meant by multimodal interfaces? 
Yes No 
7. Have you ever used speech recognition as a dictation tool? 
Yes No 
8. Have you ever used speech recognition to issue commands to a computer programme? 
Yes No 
9. Have you ever used eye tracking as a means to interact with an application? 
Yes No 
10. When working in a word processor do you make use of shortcut keys? 
Yes No 
11. Are you a touch typist? 
Yes No 
  
228 
 
 
APPENDIX B 
EXPERT REVIEW TASK LIST 
 
Tasks 
 
1. Familiarise yourself with the new Word environment which makes use of the eye tracking, speech 
recognition and onscreen keyboard.  
a. Make sure that you are comfortable using your eye gaze and the verbal command “Go” to 
write a character to the document. 
b. Insert a few words of your choice into the document and try and increase your speed to a 
speed that is comparable with your normal typing speed. 
c. Change some of the options for interaction techniques, such as the length of the dwell time, 
using blinking and the enter key as well as the voice commands. 
d. Change other options such as the shape of your gaze indicator. 
e. Use the verbal commands in the table below to navigate through your document and make 
changes. Use them in combination to determine how they work together. 
  
229 
 
 
APPENDIX C 
FEASIBILITY STUDY POST-TEST QUESTIONNAIRE 
 
Post-Test questionnaire 
1. When you first encountered the system, were you sceptical as to its practicality or were you excited? 
Sceptical Excited 
 
2. Please explain the reason for your answer in question 1. 
 
3. Did your interaction with the system change your mind? 
Yes No 
 
4. Please explain the reason for your answer in question 3. 
 
5. As a person involved in the IT field, do you think there is a need for multimodal interfaces? 
Yes No 
 
6. Do you think that the combination presented to you today is a viable option for a multimodal 
interface? 
Yes No 
 
7. Please explain the reason for your answer in question 6. 
 
8. As an initial impression, did you feel excited about the opportunities presented by the system? 
Yes No 
 
9. In terms of a more long term application, do you think the combination presented to you would be a 
solution to multimodal interfacing? 
Yes No 
 
10. From the point of view of a mainstream user, would you prefer that the multimodal options were 
available in a widely used package such as Microsoft Word? 
Yes No 
 
11. From the point of view of a mainstream user, do you think that the multimodal interface will assist in 
working more efficiently under varying working conditions? 
Yes No 
 
12. From the point of view of a mainstream user, do you like the idea that the multimodal interface will 
provide more flexibility and choice of input techniques? 
Yes No 
 
13. Disabled users who cannot make use of a keyboard and mouse are generally forced to make use of a 
specially designed application. As a consequence, they are not normally assimilated into the user 
group of the mainstream applications. Do you feel that the distinction is justified? 
Yes No 
 
230 
 
 Appendices 
14. Are you of the opinion that the solution offered in Microsoft Word is a possible solution for disabled 
users? 
Yes No 
 
15. As a first impression, did you find the use of eye-tracking as an interaction technique an exciting 
development? 
Yes No 
 
16. As a more long term application, do you think the use of eye tracking as an interaction technique will 
be beneficial? 
Yes No 
 
17. Rank the eye gaze interaction techniques in the order in which you enjoyed using them where 1 is the 
most enjoyable and 4 the least enjoyable. 
 Dwell time  Blinking  Enter key  Combined with voice commands 
 
18. Indicate which eye gaze interaction techniques you think are the most viable/usable for a more long 
term use. Rank them from 1 as the most viable to 4 as the least viable. 
 Dwell time  Blinking  Enter key  Combined with voice commands 
 
19. As a first impression, did you find the use of speech recognition for verbal commands an exciting 
development? 
Yes No 
 
20. As a more long term application, do you think the use of speech recognition for verbal commands will 
be beneficial? 
Yes No 
 
21. Did you enjoy using the verbal commands? 
Yes No 
 
22. Did the verbal commands allow you to navigate easier than what you normally do? 
Yes No 
 
23. If you answered no to question 20, do you think that with practice you will be more efficient with 
verbal commands than the way you normally do? 
Yes No 
 
24. As a first impression, did you find the use of eye-tracking and speech recognition together an exciting 
development? 
Yes No 
 
25. In terms of a long-term solution, do you think eye-tracking and speech recognition together will offer 
a usable working environment? 
Yes No 
 
26. Please provide suggestions for improvements or changes to the application. 
  
231 
 
 
 
APPENDIX D 
POINTING DEVICE STUDY PRE-TEST QUESTIONNAIRE 
 
 
 
University of the Free State 
Department of Computer Science and Informatics  
 
Pre-test Questionnaire 
ON BEHALF OF THE UNIVERSITY OF THE FREE STATE AND THE DEPARTMENT OF COMPUTER SCIENCE WE WOULD LIKE TO THANK YOU FOR 
PARTICIPATING IN THIS RESEARCH PROJECT.   
 
WE OFFER OUR ASSURANCE THAT ALL INFORMATION CAPTURED AND/OR RECORDED HERE WILL ONLY BE USED FOR RESEARCH PURPOSES 
AND YOUR PARTICIPATION IS VOLUNTARY.  
 
PLEASE ANSWER THE FOLLOWING QUESTIONS. 
 
1. Subject unique identifier (will be provided by the 
facilitator):__________________________________ 
 
2. Age:  _________ 
 
3. Home Language:  __________________________________ 
 
4. For how many years have you been using a computer? 
 
 Never used a computer  Less than 1 year 
    
 1 – 3 years  3-5 Years 
    
 More than 5 years   
 
5. How often do you use a computer? 
 
 Daily  Weekly 
    
 Once every two weeks  Once a month 
    
 Less than once a month   
 
 
  
232 
 
 Appendices 
6. For how many years have you been using a computer mouse? 
 
 Never used a mouse  Less than 1 year 
    
 1 – 3 years  3-5 Years 
    
 More than 5 years   
 
7. How often do you use a computer mouse? 
 
 Daily  Weekly 
    
 Once every two weeks  Once a month 
    
 Less than once a month   
 
8. Have you ever used an eye tracker to work on a computer? 
 
 Yes  No 
 
If Yes, proceed to Question 9, else proceed to Question 11. 
 
9. Have you ever used an eye tracker as a pointing device (substitute for a mouse)? 
 
 Yes  No 
 
If Yes, for how long and how often do you use it? 
         
 
 
10. Specify in what capacity you have used an eye tracker. 
         
11. Have you ever used speech recognition to work on a computer? 
 
 Yes  No 
 
If Yes, proceed to Question 12, else the questionnaire is complete. 
 
12. Have you ever used speech recognition for cursor control? 
 
 Yes  No 
 
If Yes, for how long and how often do you use it? 
         
 
 
13. Specify in what capacity you have used speech recognition. 
        
 
233 
 
 
APPENDIX E 
POINTING DEVICE ASSESSMENT QUESTIONNAIRE 
 
Device assessment 
Please circle the x that is most appropriate as an answer to the given comment. 
 
1. The force required for actuation (propelling or moving the device) was 
 x x x x x 
 too low    too high 
      
2. Smoothness during operation was 
 x x x x x 
 very rough    very smooth 
      
3. The mental effort required for operation was 
 x x x x x 
 too low    too high 
      
4. The physical effort required for operation was 
 x x x x x 
 too low    too high 
      
5. Accurate pointing was 
 x x x x x 
 easy    difficult 
      
6. Operation speed was 
 x x x x x 
 too fast    too slow 
      
7. Neck fatigue 
 x x x x x 
 none    very high 
      
8. General comfort: 
 x x x x x 
 very    very 
uncomfortable comfortable 
      
9. Overall, the input device was 
 x x x x x 
 very difficult to    very easy to 
use use 
      
10. Please indicate which of the following you preferred by circling your preferred method. 
 
Large Buttons     Small Buttons 
234 
 
 Appendices 
 
 
11. Please indicate which of the following you preferred by circling your preferred method. 
 
Framed button     Inverted colour button 
 
 
12. Do you think that you will eventually be able to achieve the same speeds with eye gaze and speech 
recognition as with a mouse?    Y  / N 
 
13. Did you enjoy working with the speech recognition and eye gaze as a pointing device? Y     / N 
 
14. When using the mouse, did the magnification tool assist you to work more accurately? Y    / N 
 
Comments:             
 
15. When using eye gaze and speech recognition, did the magnification tool assist you to work more 
accurately?  Y / N 
 
Comments:             
 
 
Any other comments and suggestions: 
            
            
            
 
  
235 
 
 
APPENDIX F 
USER TESTING PRE-TEST QUESTIONNAIRE 
 
 
University of the Free State 
Department of Computer Science and Informatics  
 
Pre-test Questionnaire 
ON BEHALF OF THE UNIVERSITY OF THE FREE STATE AND THE DEPARTMENT OF COMPUTER SCIENCE WE WOULD LIKE TO 
THANK YOU FOR PARTICIPATING IN THIS RESEARCH PROJECT.   
 
WE OFFER OUR ASSURANCE THAT ALL INFORMATION CAPTURED AND/OR RECORDED HERE WILL ONLY BE USED FOR 
RESEARCH PURPOSES AND YOUR PARTICIPATION IS VOLUNTARY.  
 
PLEASE ANSWER THE FOLLOWING QUESTIONS. 
 
 
1. Student Number:  
___________________________________________ 
 
2. Age:    
 
3. Gender:  Male / Female 
 
4. Current field of study:        
 
5. For how many years have you been using a word processor? 
 
 Never used a word processor  Less than 1 year 
    
 1 – 3 years  3-5 Years 
    
 More than 5 years   
 
6. How often do you use a word processor? 
 
 Daily  Weekly 
    
 Once every two weeks  Once a month 
    
 Less than once a month   
 
7.  
8. Do you ever use keyboard shortcuts in a word processor? 
 
 Yes  No 
 
236 
 
 Appendices 
9. Do you prefer using a mouse or the keyboard to complete tasks in a word processor? 
 
 Mouse  Keyboard 
 
10. Have you ever used an eye tracker to work on a computer? 
 
 Yes  No 
 
If Yes, proceed to Question 10, else proceed to Question 12. 
 
11. Have you ever used an eye tracker as a pointing device (substitute for a mouse)? 
 
 Yes  No 
 
If Yes, for how long and how often do you use it? 
         
 
12. Specify in what capacity you have used an eye tracker. 
         
 
 
13. Have you ever used speech recognition to work on a computer? 
 
 Yes  No 
 
If Yes, proceed to Question 13, else the questionnaire is complete. 
 
14. Have you ever used speech recognition for cursor control? 
 
 Yes  No 
 
If Yes, for how long and how often do you use it? 
         
 
 
15. Specify in what capacity you have used speech recognition. 
         
 
 
 
237 
 
 
APPENDIX G 
POST- TEST QUESTIONNAIRE – FIRST SESSION 
 
Adapted from Shneiderman (1998D)e. signing the User Interface. p 136 – 143. 
 
PART 3: Overall User Reactions 
 
3.1 Overall reaction to the system: Terrible  Wonderful 
 
 1 2 3 4 5 
      
3.2 Frustrating  Satisfying 
 1 2 3 4 5 
      
3.3 Dull  Stimulating 
 1 2 3 4 5 
      
3.4 Difficult  Easy 
 1 2 3 4 5 
      
3.5 Inadequate  Adequate 
 1 2 3 4 5 
      
3.6 Rigid  Flexible 
 1 2 3 4 5 
 
 
PART 6: Learning 
 
6.1  Learning to operate the system Difficult  Easy  
 
 1 2 3 4 5 
6.1.1 Getting started Difficult  Easy 
 1 2 3 4 5 
6.1.2  Learning advanced features Difficult  Easy 
 1 2 3 4 5 
6.1.3  Time to learn to use the system Slow  Fast 
 1 2 3 4 5 
 
 
PART 7: System capabilities 
      
7.4 Correcting your mistakes Difficult  Easy 
      
7.5  Ease of operation depends on your levelN ever  Alway
of experience s 
 1 2 3 4 5 
7.5.1  You can accomplish tasks knowing With  Easily 
only a few commands difficulty 
 1 2 3 4 5 
7.5.2  You can use features/shortcuts With  Easily 
difficulty 
 1 2 3 4 5 
238 
 
 Appendices 
 
Device assessment 
Please circle the x that is most appropriate as an answer to the given comment. 
 
1. The force required for actuation (propelling or moving the device) was 
 x x x x x 
 too low    too high 
      
2. Smoothness during operation was 
 x x x x x 
 very rough    very smooth 
      
3. The mental effort required for operation was 
 x x x x x 
 too low    too high 
      
4. The physical effort required for operation was 
 x x x x x 
 too low    too high 
      
5. Accurate pointing was 
 x x x x x 
 easy    difficult 
      
6. Operation speed was 
 x x x x x 
 too fast    too slow 
      
7. Neck fatigue 
 x x x x x 
 none    very high 
      
8. General comfort: 
 x x x x x 
 very    very 
uncomfortable comfortable 
      
9. Overall, the input device was 
 x x x x x 
 very difficult to    very easy to 
use use 
 
10. Do you think that you will eventually be able to achieve the same speeds with eye gaze and speech 
recognition as with a mouse and keyboard?  Y  / N 
 
11. Did you enjoy working with the speech recognition and eye gaze as a pointing device?  Y     / N 
 
12. Do you think the added features in Word are useful?  Y / N 
 
13. Do you think the added features in Word make the Word application better?   
Y / N 
239 
 
 Appendices 
 
14. Do you think that the added features in Word will ever gain mainstream use?   
Y / N 
 
Give reasons for your answer: 
            
             
 
 
15. Do you think that there can be a market for the added features in Word?  Y / N 
 
Give reasons for your answer: 
            
            
             
 
 
16. Would you consider adopting the added features as a standard means of using Word for yourself? 
Y / N 
 
 
Any other comments and suggestions: 
            
            
            
            
            
 
  
240 
 
 
APPENDIX H 
POST-TEST QUESTIONNAIRE – LAST SESSION 
 
 
 
University of the Free State 
Department of Computer Science and Informatics  
 
Post-test Questionnaire 
 
PLEASE ANSWER THE FOLLOWING QUESTIONS. 
 
 
Student Number:  
___________________________________________ 
 
Do you have corrected vision (glasses / contact lenses)? If so, please indicate which: 
 
Y  / N 
 
Glasses  /  Contact lenses 
 
 
1. Do you think that you will eventually be able to achieve the same speeds with eye gaze and speech 
recognition as with a mouse and keyboard?   Y  / N 
 
2. Did you enjoy working with the speech recognition and eye gaze as a pointing device? Y   / N 
 
3. Do you think the added features in Word are useful?   Y / N 
 
4. Do you think the added features in Word make the Word application better?  Y / N 
 
5. Do you think that the added features in Word will ever gain mainstream use?  Y / N 
Give reasons for your answer: 
            
             
 
6. Do you think that there can be a market for the added features in Word?  Y / N 
Give reasons for your answer: 
            
             
 
 
7. Would you consider adopting the added features as a standard means of using Word for yourself? 
Y / N 
  
241 
 
 Appendices 
 
Give reasons for your answer: 
            
             
 
PART 3: Overall User Reactions – Complete system 
 
Answer the following questions based on your exnpcer wi ith the system as a whole. 
 
3.2 Overall reaction to the system: Terrible  Wonderful 
 
 1 2 3 4 5 
      
3.2 Frustrating  Satisfying 
 1 2 3 4 5 
      
3.3 Dull  Stimulating 
 1 2 3 4 5 
      
3.4 Difficult  Easy 
 1 2 3 4 5 
      
3.5 Inadequate  Adequate 
 1 2 3 4 5 
      
3.6 Rigid  Flexible 
 1 2 3 4 5 
 
 
PART 6: Learning 
 
6.1  Learning to operate the system Difficult  Easy 
 
 1 2 3 4 5 
6.1.1 Getting started Difficult  Easy 
 1 2 3 4 5 
6.1.2  Learning advanced features Difficult  Easy 
 1 2 3 4 5 
6.1.3  Time to learn to use the system Slow  Fast 
 1 2 3 4 5 
 
 
Any other comments and suggestions: 
            
            
            
         
 
  
242 
 
 Appendices 
 
PART 3: Overall User Reactions - Typing 
 
Answer the following questions based on your exnpcer oi f typing with the eye tracking and speech 
recognition. 
 
 
3.3 Overall reaction: Terrible  Wonderful 
 
 1 2 3 4 5 
      
3.2 Frustrating  Satisfying 
 1 2 3 4 5 
      
3.3 Dull  Stimulating 
 1 2 3 4 5 
      
3.4 Difficult  Easy 
 1 2 3 4 5 
      
3.5 Inadequate  Adequate 
 1 2 3 4 5 
      
3.6 Rigid  Flexible 
 1 2 3 4 5 
 
 
PART 6: Learning 
 
6.1  Learning to type Difficult  Easy 
 
 1 2 3 4 5 
6.1.1 Getting started Difficult  Easy 
 1 2 3 4 5 
6.1.2  Learning advanced features Difficult  Easy 
 1 2 3 4 5 
6.1.3  Time to learn to use the system Slow  Fast 
 1 2 3 4 5 
      
7.  How natural did it feel to type Unnatural  Natural 
using eye gaze and speech 
 1 2 3 4 5 
 
Any other comments and suggestions: 
            
             
 
  
243 
 
 Appendices 
 
PART 3: Overall User Reactions - Commands 
 
Answer the following questions based on your exnpcer oi f issuing commands to the system for 
formatting and cursor movement. 
 
 
3.4 Overall reaction: Terrible  Wonderful 
 
 1 2 3 4 5 
      
3.2 Frustrating  Satisfying 
 1 2 3 4 5 
      
3.3 Dull  Stimulating 
 1 2 3 4 5 
      
3.4 Difficult  Easy 
 1 2 3 4 5 
      
3.5 Inadequate  Adequate 
 1 2 3 4 5 
      
3.6 Rigid  Flexible 
 1 2 3 4 5 
 
 
PART 6: Learning 
 
6.1  Learning to issue commands Difficult  Easy 
 
 1 2 3 4 5 
6.1.1 Getting started Difficult  Easy 
 1 2 3 4 5 
6.1.2  Learning advanced features  Difficult  Easy 
 1 2 3 4 5 
6.1.3  Time to learn to use the system Slow  Fast 
 1 2 3 4 5 
6.1.4  Time to learn to string commands togetheSr low  Fast 
 1 2 3 4 5 
      
7.  How natural did it feel to issue commands Unnatural  Natural 
 1 2 3 4 5 
      
 
8. For each of the following command types, Difficult   Easy 
indicate how easy it was to use them: 
Moving the cursor 1 2 3 4 5 
Formatting text (e.g. bold, italic) 1 2 3 4 5 
Selecting text (e.g. line or word) 1 2 3 4 5 
Cutting/copying and pasting 1 2 3 4 5 
 
Any other comments and suggestions: 
            
             
 
  
244 
 
 Appendices 
 
15. Did you feel your typing improved over the time period in which you used the system? 
 
 Yes  No 
 
Comments: 
            
             
 
16. Did you feel more at ease with issuing commands as you became accustomed to the system? 
 
 Yes  No 
 
Comments: 
             
 
17. Would you consider using a system like this for typing purposes? 
 
 Yes  No 
 
Give a reason for your answer: 
             
 
18. Would you consider using a system like this for issuing commands? 
 
 Yes  No 
 
Give a reason for your answer: 
             
 
19. Did the audio feedback when typing assist in the typing process? 
 
 Yes  No 
 
Give a reason for your answer: 
             
 
20. Would you have preferred having visual feedback during the typing task (e.g. button changes colour)? 
 
 Yes  No 
 
Give a reason for your answer and any other suggestions for feedback you may have: 
             
 
For the typing tasks, there were five sentences you had to type right at the end of the test. The first sentence 
used large buttons, the second and third sentence smaller buttons which were spaced further apart and the 
fourth and fifth sentences used smaller buttons which were spaced closer together. 
 
245 
 
 Appendices 
 
21. Rank the buttons in order of your preference where 1 is the most liked and 3 the least liked: 
 
Smaller buttons, far  Smaller buttons, closer 
 Large buttons  
apart together 
 
Comments: 
             
 
 
22. Rank the buttons in the order in which they were easiest to use where 1 is the easiest and 3 the most 
difficult: 
 
Smaller buttons, far  Smaller buttons, closer 
 Large buttons  
apart together 
 
Comments: 
             
 
Device assessment 
Please circle the x that is most appropriate as an answer to the given comment. 
 
Answer the following questions regarding using eye gaze and speech recognition for typing: 
 
1. The force required for actuation (propelling or moving the device) was 
 x x x x x 
 too low    too high 
      
2. Smoothness during operation was 
 x x x x x 
 very rough    very smooth 
      
3. The mental effort required for operation was 
 x x x x x 
 too low    too high 
      
4. The physical effort required for operation was 
 x x x x x 
 too low    too high 
      
5. Accurate pointing was 
 x x x x x 
 easy    difficult 
      
6. Operation speed was 
 x x x x x 
 too fast    too slow 
      
10. Neck fatigue 
 x x x x x 
 none    very high 
246 
 
 Appendices 
 
      
11. General comfort: 
 x x x x x 
 very    very 
uncomfortable comfortable 
      
12. Overall, the input device was 
 x x x x x 
 very difficult to    very easy to 
use use 
 
 
Any other comments and suggestions: 
            
            
             
  
247 
 
 
APPENDIX I 
PUBLICATIONS 
 
To date, there have been four publications stemming from the research study discussed in the thesis. These 
publications are as follows: 
Appendix I-1: Abstract is reproduced here in Afrikaans as it was originally published 
Beelders, T.R. and Blignaut, P.J. (2009). A multi-modal interface for a popular word processor. Die Suid-
Afrikaanse Akademie vir Wetenskap en Kuns Studentesimposium 2009, Bloemfontein, South Africa. 
 
 
Appendix I-2 
Beelders, T.R. and Blignaut, P.J. (2010). Using vision and voice to create a multimodal interface for Microsoft 
Word 2007. Proceedings of the Symposium on Eye-Tracking Research and Applications (ETRA), Austin, Texas, 
United States of America, 173-176. 
 
 
Appendix I-3: Abstract is reproduced here in Afrikaans as it was originally published 
Beelders, T.R., Blignaut, P.J. and Greeff, F. (2010). Eye-tracking and speech recognition instead of a computer 
mouse. Die Suid-Afrikaanse Akademie vir Wetenskap en Kuns Studentesimposium 2010, Pretoria, South Africa. 
 
 
Appendix I-4: 
Beelders, T.R. and Blignaut, P.J. (2011). The Usability of Speech and Eye Gaze as a Multimodal Interface for a 
Word Processor. In I. Ipšić (Ed), Speech Technologies (pp. 385-404). ISBN: 978-953-307-996-7. 
  
248 
 
 Publications 
ŉ Multimodale koppelvlak vir ŉ gewilde woordverwerkingspakket 
 
ŉ Woordverwerker is ŉ populêre rekenaarprogram wat deur ’n diverse groep gebruikers op ’n gereelde basis 
gebruik word.  ŉ Enkele program moet dus vir ’n groot verskeidenheid gebruikers, elkeen met sy eie behoeftes 
en voorkeure vir interaksie, voorsiening maak. Voorts is gestremde gebruikers gewoonlik beperk in hulle 
keuses omdat net sekere programme hulle beperkinge in ag neem. In die algemeen word programme wat deur 
gestremde gebruikers gebruik word, nie deur die hoofstroom gebruikers gebruik nie, maar word spesiaal vir 
gestremde gebruikers geskryf. Verder is die neiging om weg te beweeg van die standaard koppelvlakke met 
menus en ikone wat met die muis gemanipuleer word. Die fokus van hierdie studie is om nie-tradisionele 
interaksie tegnieke in ’n woordverwerker in te bou en dan vas te stel of dit ’n volwaardige oplossing bied om 
toeganklikheid vir alle gebruikers te verseker. 
Daar is heelwat woordverwerkingsprogramme op die mark beskikbaar, waarvan Microsoft Word die gewildste 
is. Hierdie studie gebruik dus Microsoft Word 2007 as ’n basis waarin ekstra interaksie-tegnieke ingebed word 
om ’n multimodale koppelvlak te bied. Een van die nuwe interaksietegnieke laat toe dat ’n oog-volgapparaat 
(Engels “eye-tracker”) gebruik word om ’n dokument te redigeer. Vir dié doeleindes kan ’n muis kliek op 
verskeie maniere deur die gebruiker se oë gesimuleer word. Die tweede nuwe interaksietegniek wat in die 
koppelvlak ingebed word, maak voorsiening vir die gebruik van spraakherkenning om teks te dikteer, sowel as 
om redigeringsopdragte hardop uit te spreek. 
Genoemde twee interaksietegnieke kan ook gekombineer word sodat die konteks van ŉ mondelinge instruksie 
bepaal word deur die item waarna gekyk word.  So byvoorbeeld kan die gebruiker na die “Bold” ikoon in die 
taakbalk kyk en dan hardop sê “click”.  Verder word ŉ afbeelding van ŉ toetsbord onder-aan die skerm vertoon 
en die gebruiker kan ŉ dokument in Microsoft Word tik deur slegs na die onderskeie toetse op die toetsbord te 
kyk.  Die muiswyser volg die gebruiker se blik en die onmiddellike area onder die muiswyser kan ook vergroot 
word om dit vir gebruikers met swak sig makliker te maak om met die koppelvlak te werk. 
Die studie beoog om die verskillende interaksietegnieke met mekaar te vergelyk om te bepaal watter 
kombinasie van tegnieke die bruikbaarste is.  ’n Ekspertanalise is reeds gedoen om die langtermyn 
lewensvatbaarheid van sodanige koppelvlak te evalueer en om die eerste indrukke van die interaksietegnieke 
soos wat hulle in Word 2007 gebruik kan word, te kry.  
Die volgende stap is om te bepaal of die nuwe interaksietegnieke produktiwiteit verhoog en of gebruikers kan 
leer om die tegnieke te gebruik om aan hulle bepaalde omstandighede en behoeftes te voldoen. Om dit te 
doen sal toetsgebruikers gevra word om verteenwoordigende take uit te voer deur van al die moontlike 
interaksietegnieke gebruik te maak.  Die tyd wat gebruikers neem en die korrektheid waarmee take uitgevoer 
word, sal vergelyk word om te bepaal of die veranderde koppelvlak gebruikers toelaat om ten minste dieselfde 
vlak van produktiwiteit te behaal as wat met ŉ standaard koppelvlak bereik kan word. 
  
249 
 
 Publications 
Using Vision and Voice to Create a Multimodal Interface  
for Microsoft Word 2007 
 The research study is still in the beginning phwashe re 
development of the tool is underway. Therefore,  tfhoer 
Abstract purposes of this paper, the application as it heaesn  b
developed will be the main focus. The paper will, 
however, conclude with a short discussion of thex t ne
There has recently been a call to move away froem s tahndard phases of the research stu dy.
WIMP type of interfaces and give users access tor e mo
intuitive interaction techniques. Therefore, it oinrd er to test 
the usability of a multimodal interface in Word 270, 0the most Interaction Techniques  
popular word processor, the additional modalitife se yoe gaze 
and speech recognition were added within Word 20a0s 7 Using a physical input device in order to commutnei coar 
interaction techniques. This paper discusses thve lodpeed perform a task in human-computer dialogue is ca allend 
application and the way in which the interactionch tneiques interaction technique [Foley, et al., 1990 as c inte dJacob, 
are included within the well-established environmt oefn Word 1995]. The interaction techniques of speech rectioognn i
2007. The additional interaction techniques are ly ful and eye tracking will be included in a popular w ord
customizable and can be used in isolation or in bcinoamtion. processor interface to create a multimodal inter fasc a 
Eye gaze can be used with dwell time, look and ts hoor o means to determine whether the usability of thiosd upcr t 
blinking and speech recognition can be used fotra tdioicn and can be enhanced in this way. 
verbal commands for both formatting purposes anvdig naation 
through a document. Additionally, the look and sth mooethod Although this approach has received limited atotenn tihus 
can also be combined with a verbal command to itfatceil a far, the multimodal approach has always focusedt hoen  
completely hands-free interaction. Magnification  othfe development of a third-party application, for exalem p
interface is also provided to improve accuracy amnudlt iple EyeTalk [Hatfield and Jenkins, 199 7C].ontrary to this, 
onscreen keyboards are provided to provide hanedes t yfrping this study will use an already existing applica,t ionnamely 
capabilities. Microsoft Word ©, which currently enjoys a high 
prevalence in the commercial market. 
Keywords:E ye-tracking, speech recognition, usability, word 
processing, multimod al Development environment 
Introduction The development environment used was Visual Studio 
2008, making use of the .NET Framework 3.5. Visual 
The word processor has become a very popular nto othl ei Studio Tools for Microsoft Office System 200(V8S TO) in 
everyday use of a computer [Roberts and Moran, ]1 a9n8d3 C# was used for development. VSTO allows program mers
by 1984, 80-100% of users’ time on a computer wpaesn ts to use managed code to build Office-based solu tiino nCs# 
using a word processor or other editor-based aaptpiolinc and VB.NET[ Anderson, 200]9. In order to incorporate the 
[Rosson, 1984]. The word processor application hvaoslv ed speech recognition the Microsoft Speech Applica tion
substantially since its initial inception and si ntcheen has Programming Interface (SAPI) with version 5.1 ofe  th
undergone a virtual metamorphosis to achieve thpea bcialities SDK was used. The SDK provides the capability of 
that are available in these applications today.a An si ntegral compiling customized grammars and accessing the 
part of everyday life for many people it caters fao rvery functionalities of the speech recognizer. In ordtoe r 
diverse group of users, therefore, it is highlyi kuenlyl  that only provide gaze interaction Tobii SDK 1.5.4 was usFeodr. 
one such complex application would be able to o tfhfer best magnification purposes, which will be discussed ain  
possible experience to all users [Sullivan, 198F9u]r.t hermore, upcoming section, the commercial product Magnify ing
users with disabilities or needs other than thof sme aoinstream Glass Pro 1.7 was chosen as a relatively inexpe nsiv
users are not always taken into consideration dgu sriynstem solution but primarily based on the fact that its w oane of 
development and often have to compensate by uspiencgi aslly  the few applications which incorporated clickablre aas 
designed applications which do not necessarily caorme pwith within the magnified area which are then correctly 
the more popular applications. This study there faoirmes to transferred to the underlying area. This is esasle nint i the 
investigate various means to increase the usa boifl itay word developed product as the magnification will incree athse 
processor for as wide a user group as possible. accuracy of cursor positioning via eye gaze andr ecot r
interpretation of user intention and requiring tuhsee r to 
For this reason, the interface of the most popuwlaorr d disable magnification before clicking on the inatecref 
processor application will be extended into a mmuoltdi al would negate all the advantages gained from 
interface. This interface should facilitate use othf e magnification 
mainstream product by marginalized users, whil stht ea tsame 
time enhancing the user experience for novice,r minetediate The aim of the development process was to incotrep ora
and expert users. Ideally the interface should ubseto cmizable speech recognition and eye tracking as additional 
and allow users to select any combination of inctteioran interaction techniques in the Microsoft Word ennvimroent. 
techniques which suit their needs. The premiseh eo fr et search The user should also be given the freedom to deinter min 
study is not to develop a new word processor btuhte ra to which combination the interaction techniques mues t b
incorporate additional interaction techniques, dbes i the used, while still having the option of continuede  uosf the 
keyboard and mouse, into an application which hlraesa day traditional interaction techniques. As illustratiend F igure 
been accepted by the user community. This willw a lflor the 1, an extra tab was added to the established Moifctr os
improvement of an already popular product and sltaimteu Word ribbon. This tab (circled in red) was named 
inclusiveness of non-mainstream users into the smtraeianm Multimodal Add-Ins. 
market.  
250 
 
 Publications 
The new tab provides numerous options to the uos er t complete customization of the techniques is allo wviead 
select which additional interaction techniques  thweoyuld selection of any combination of techniques as waes lli n 
like to use (Figure 1). As is evident from Figure,  1 what capacity the techniques must be implemented.  
Figure 1: Multimodal Add-ins for Word 2007 
 
Additional tools which are available to enhance uthseer between dictation mode and command mode. In doicnt ati
experience are a magnification tool and an onsc reen mode, the speech recognition is implemented in wtheell- 
keyboard which can be displayed at the bottom eo fW thord known method of capturing vocalizations, translga ttihnose 
document. The magnification tool magnifies the imdmiaete vocalizations into text and writing the result htoe  tcurrently 
area under the mouse cursor, thereby providinge ainscerd activated document in Microsoft Word. In order ftohre  
accuracy for users with weak eyesight and thosein mg aukse dictation mode to be effective the user must s elae ct
of the gaze sensitive interface. Magnification visa ilaable previously trained profile. A unique profile can btreained 
when using the mouse or when using eye gaze as an through the Windows Speech wizard. All the avaeila bl
interaction technique. The use of the magnifica ttioonl is speech profiles are provided in a drop-down box thoen  
entirely at the discretion of the user who is calep aobf multimodal add-in tab for the convenience of ther .u s
turning magnification on and off at will or as needd. 
In command mode, a grammar is activated which atsc cep
Onscreen keyboards are available as an alterntaot ivues ing only isolated commands and responds to these inre -a p
a traditional keyboard. The onscreen keyboard cea nu sbed determined manner. Command mode provides the funnsc tio
either through use of the traditional mouse or ctoh ieave of cursor control, formatting capabilities and caeinr t
hands-free typing using eye gaze or a combinatifo eny oe document handling capabilities. Several differeonmt mc ands 
gaze and speech recognition. The final adaptedrf ainctee, as are provided which have the same application roena,c ti
envisioned in use when the on screen keyboard  uis ein, is thereby contributing to further customization fhoer  tuser as 
shown in Figure 2.  they can determine which the most desirable comm isa nd
for them to use. Moreover, simple cursor contr opl riosvided 
The layout of the onscreen keyboard can be chantog ed by providing directional commands but more complex 
either a traditional QWERTY keyboard layout or to an cursor control is also provided by allowing linel escetion 
alphabetic layout. Each keyboard contains all 2p6h abl etic and movement of the cursor as though control kseuycsh ( as 
letters, a Space bar, Backspace and Delete keyse lal sa sw Shift) are being pressed in combination with therb avel 
special keys which simplify movement through the command. These types of commands will simplify csteiolen 
document. Special keys which are provided are Puapg, e of text and provide verbal commands for complex  key
Page down, Home and End. The user can also toggle combinations which are not always known to novicned  a
between upper case and lower case by activating  and intermediate users. For example, the word “Bold”s ceasu the 
deactivating the CAPS lock key. A Select All key is activation or deactivation of the bold formattingty les. 
provided as a means for the user to select altle txhte i n the Similarly the words “Italic” and “Underline” activtea or 
document. The two red arrows in the lower left ceor ronf the deactivate their formatting style. Words such as t”“,C u
keyboard (Figure 2) change the size of all keyb okaeryds in “Copy” and “Paste” allow for text manipulation andre  a
decrements and increments of 10 pixels respec,t ively their subsequent actions are of course the cuottri ncgo pying 
thereby providing even more customization of thyeb koeard of the currently selected text and the pastingh eo fc tlipboard 
for the user. Auditory feedback in the form of aft  sboeep is contents at the position of the cursor. More comx ple
given when a keyboard key is clicked on.  commands for text selection are available such Saesle “ct 
line”, which selects the whole line on which ther scour is 
Speech recognitio n situated, “Select word”, which selects the word rneesta to 
the right of the current cursor position. Cursor tcroln is 
The user has the option of enabling the speechn e nsgoi  achieved through the commands “Left”, “Right”, “Upa”n d 
that Microsoft Word can respond to verbal utterasn. cIen “Down”. Verbal commands can be issued in sequenoc e t
terms of the customizable options, the user cang leto g perform relatively complex document manipulatio n.  
 
 
 
 
251 
 
 Publications 
 
 
 
 
 
 
 
 
 
 
Figure 2:A dapted interface of Word 2007 when the onscreen keyboard is activated 
 
Eye-tracking recognition must also be enabled and the user can 
then issue verbal commands to move the cursor to 
The eye tracker can be calibrated for use directly the current gaze position which is analogous to 
through the Microsoft Word interface. This executing a left mouse click at that position. hInis  t
increases the usability of the application as tsher  u way, it is possible for the user to place the cru rasto 
is not required to move between applications to any position in the document, or to click one oef th
achieve their goal of using gaze as an interaction Microsoft Word icons on the ribbon. The verbal 
technique. Since the word processor is the foc us of commands of “Go”, “Click” or “Select” all simulate 
this study, this meets the requirement of the rrecshe a a left mouse click at the button closest to ther ecnutr 
question scope.  The user has the option to aec tivat gaze position. In this way, the user is free too csheo 
eye gaze which can then be used to position the the command which they find most suitable for 
cursor in the document or over an object to be them. 
manipulated. Customization is provided by allowing 
the user to choose the activation method. For use In most instances it is envisioned that the onsnc ree
purely as a single interaction method, the chooicf es keyboard will also be activated under these 
dwell time, look and shoot and blinking are circumstances. When the onscreen keyboard is 
provided. When dwell time is selected, the user is activated in conjunction with the eye gaze, visual 
able to set the interval of the dwell time (see the feedback is given to the user to indicate which 
Sensitivity Setting text box o Figure 1). This button will be clicked when the verbal command is 
provides additional customization as the user can issued. With each fixation that is detected witthien 
determine the speed with which they are most boundaries of the keyboard, the button which is 
comfortable and leaves the option for adjustings thi closest to that fixation is determined to be threg etat 
interval as the user gains more confidence and and a shape is displayed in the centre of the nb.u tto
experience with gaze based interaction. The inlt erva The user can also select which shape they would 
can be changed at any time during the use of the like to use for visual feedback. The available sehsa p
application. Dwell time requires the user to fix oante are a solid square, a hollow square, a solid dnisdk  a
a position for the set time of the dwell time invtaelr a hollow circle. The hollow shapes do not obscure 
before a left mouse click is simulated. When the letter of the key and in so doing provide the 
selecting the look and shoot method, the user can necessary visual feedback whilst still allowing  the
position the cursor using eye gaze and then phres s t user to see the letter which will be written to  the
Enter key to simulate a left mouse click. This wdo ul document. Feedback is only given on the keyboard 
have the effect of either placing the cursor at  the to minimize interference during normal document 
position of the eye gaze or clicking on the icon browsing. In order to achieve increased stabiloizna ti
directly under the eye gaze of the user. The third of the feedback within a targeted object, the 
option available to the user is that of blinkingn.  I algorithm as suggested by Kumar (2007) was used.  
this scenario, the user fixates on the desiredc ot bje
or position and then blinks their eyes to simula te If the user is satisfied that the correct buttons  ha
left mouse click. been determined they can then issue any of the 
verbal commands to simulate a left mouse click. 
Multiple interaction technique s The letter shown on the keyboard is then writte n to
the document at the current cursor position. 
When the user selects No activation (Figure 1) via 
eye gaze that implies that they will instead ben gu si
voice commands to respond to the current eye gaze 
position of the user. In this instance, the speech 
252 
 
 Publications 
Where to next?  mainstream users whilst simultaneously providing 
an adaptable and usable interface for disableds .u ser
As previously mentioned, the research study isl  stil For these purposes, eye tracking and speech 
in the preliminary stages of an empirical study.  An recognition capabilities were built into the Word 
application has been developed to investigate the interface. These interaction techniques can be used
effect of multimodal interaction techniques on the in isolation or in combination and the way in wh ich
usability of a mainstream word processor they are used can be customized in a number of 
application. Further enhancements to the applinc atio ways. Once the development has been completed 
will include the expansion of the keyboards to and measurements can be captured automatically in 
include numerical keys and the magnification will the background during user interaction, a 
be refined to respond to eye gaze and voice longitudinal usability study will be undertaken B oth
commands. More voice commands will be provided disabled and able-bodied users will be included in 
for, particularly for commands that currently have the sample and will be required to complete a 
shortcut keys assigned to them, such as Save and number of practice sessions with the application 
displaying certain dialog boxes.  over a prolonged period of time. After each ses,s ion
participants will be required to complete a number 
Additionally, a back-end will be written for the of tasks, during which measurements will be 
application which will capture certain measurem ents captured for further analysis In this way, it wbilel  
which can be used for usability analysis. possible to determine whether users are able to 
Measurements such as the number of errors made improve their performance on the system over an 
during a task, the number of actions required ahned  t extended period – in other words, whether the 
percentage of the task completed correctly will system is usable. Additionally, user performance 
automatically be saved to a database for further between the new application and the commercially 
analysis. available application will be compared to determ ine
whether they can achieve comparable performance 
Once the application has been completed, user on both the systems. In this way, it will be polses ib
testing will commence. Both disabled and non- to determine whether a popular commercial 
disabled users of a local university will be application can be fully extended into a worthw hile
approached to participate in the study. A multimodal application which caters for a diverse 
longitudinal study will be conducted whereby the group of users comprised of both disabled and able-
participants will be required to spend periods bodied users. 
interacting with the system. After each exposur e to
the system, users will be required to complete a References  
number of tasks for which measurements will be 
captured. In this way, the learnability of the syt ud
can be measured over a period of time by comparing ANDERSON, T. (2009).P ro Office 2007 development 
the results of these sessions to determine if user with VSTO. APress: United States of America. 
performance increases in correlation to user 
exposure to the application. Since it is expechteadt  t HATFIELD, F. AND JENKINS, E.A. (1997). An 
there will be a learning curve associated with the interface integrating eye gaze and voice recognn itio
application, it is deemed more applicable to caep tur for hands-free computer access. In Proceedings of 
usability measurements over a period of time ra ther the CSUN 1997 Conference. 
than only after a single session with the apploicna. ti
In order to determine whether the application JACOB, R.J. (1995). Eye tracking in advanced 
succeeds in providing for disabled users whilst interface design. In Virtual Environments and 
simultaneously providing for a better user Advanced interface Design, W. Barfield and T. A. 
experience for mainstream users, it is imperative Furness, Eds. Oxford University Press, New York, 
that users from both these demographics be inc luded NY, 258-288. 
in the sample. Furthermore, to further investigate 
the usability of the newly developed application, KUMAR, M. (2007). Gaze-enhanced user interface 
user efficiency effectiveness can be measured in a design. PhD Thesis, Stanford University. 
within-subjects experiment by requiring users to 
complete identical tasks in both the commercial ROBERTS, T.L. AND MORAN, T.P. (1983). The 
Microsoft Word and the new multimodal Microsoft evaluation of text editors: Methodology and 
Word. empirical results. Communications of the ACM, 
26(4): 265-283. 
Moreover, the usability of the various interaction 
techniques will also be analyzed to determine w hich ROSSON, M.B. (1984a). Characterizin gfreeform 
combination of the interaction techniques provides editing behavior. IBM Research Report RC 10550, 
the most usable interface – if any. User satisofanc ti IBM T. J. Watson Research Center, Yorktown 
will be measured through means of a questionnaire Heights, New York. 
in order to gauge user reaction, both in a shormrt- te
and long-term exposure period.  SULLIVAN , P. (1989). Human-computer interaction 
perspectives on word-processing issuCeosm. puters 
and Composition, 6(3): 11-33 .
Summary 
 
A multimodal interface was developed for Microsoft 
Word in order to eventually determine whether the   
usability of this application can be enhanced for 
253 
 
 Publications 
Oogvolging en spraakherkenning in plaas van ’n rekenaarmuis 
 
Die doel van die studie was om te bepaal hoe effektief ŉ oog-volgapparaat (“eye-tracker”) en 
spraakherkenning in plaas van ’n muis gebruik kan word om teikens op ‘n rekenaarskerm te selekteer. Die 
International Standards Organisation (ISO) standaard 9241-9 bestaan uit ses seleksietake.  Een van hierdie take 
vereis dat die gebruiker die muis of alternatiewe aanwyser moet gebruik om 16 teikens in ŉ sekere volgorde te 
selekteer. Die effektiwiteit word gemeet in terme van die spoed en akkuraatheid waarmee die seleksies 
gedoen word.  Op hierdie wyse kan bepaal word of teikens met dieselfde effektiwiteit geselekteer kan word 
met oogvolging en spraakherkenning as wat die geval is met ’n muis. 
Vir elk van die seleksietegnieke is die effektiwiteit verder ondersoek met betrekking tot die grootte van die 
teiken, die gebruik van ’n gravitasieput, ’n elektroniese vergootglas en visuele terugvoer.  ŉ Gravitasieput laat 
’n gebruiker toe om effens buite die teiken te kliek en dan word die wyser as’t ware in die teiken ingetrek.  ŉ 
Elektroniese vergrootglas vergroot die area direk onder die wyser en visuele terugvoer behels dat ‘n raampie 
om die geselekteerde teiken getrek word.  Elke toetspersoon het die seleksietaak met 14 verskillende 
kombinasies van faktore uitgevoer. ’n Gebalanseerde Latynse vierkant is gebruik om die volgorde van toetse 
vir elke persoon te bepaal sodat die effek van leer deur ervaring geminimaliseer  word. 
Twintig studente het aan die studie deelgeneem en daar is van deelnemers verwag om ten minste muisvaardig 
te wees.  Benewens die seleksietaak wat elke deelnemer op 14 verskillende maniere moes doen, moes elke 
deelnemer ook ’n vraelys voltooi om subjektiewe terugvoer omtrent elkeen van die verskillende toetsvariasies 
te verkry. 
Analise van die data sal bepaal of die kombinasie van oogvolging en spraakherkenning effektief genoeg is om 
as alternatiewe interaksietegniek vir rekenaargebruik te dien. 
 
 
Eye gaze and speech recognition instead of a computer mouse 
The combination of eye gaze and speech recognition as a selection technique was investigated using the 
ISO9241-9 multi-directional tapping task. Twenty participants were tested on 14 conditions with varying target 
size, magnification capabilities and presence of a gravity well. Analysis of the data will determine whether this 
is a viable alternative to the mouse. 
  
254 
 
 Publications 
The Usability of Speech and Eye Gaze as a Multimodal 
Interface for a Word Processor 
T.R. Beelders and P.J. Blignaut 
University of the Free State 
South Africa 
1. Introduction 
Communication between humans and computers is considered to be two-way communication between two 
powerful processors over a narrow bandwidth (Jacobs and Karn, 2003). Most interfaces today utilise more 
bandwidth with computer-to-user communication than vice versa, leading to a decidedly one-sided use of the 
available bandwidth (Jacobs and Karn, 2003). An additional communication mode will invariably provide for an 
improved interface (Jacobs, 1993) and new input devices which use passive measurements to capture data from 
the user both conveniently and at a high speed are well suited to provide more balance in the bandwidth 
disparity (Jacobs and Karn, 2003). In order to better utilise the bandwidth between human and computer, more 
natural communication which concentrates more on parallel and not sequential communication is required 
(Jacobs, 1993).  
 
Furthermore, the user interface is the connection between the user and the computer and as such plays a vital 
role in the success or failure of an application. Modern-day interfaces are entirely graphical and require users to 
visually acquire and manually manipulate objects on screen (Hatfield and Jenkins, 1997) and the current trend of 
Windows, Icons, Menu and Pointer (WIMP) interfaces has already been around since the 1970s (van Dam, 2001). 
Unlike their command line counterparts, these graphical user interfaces are not in the least accessible to users 
with disabilities and it has become essential that viable alternatives to mouse and keyboard input are found 
(Hatfield and Jenkins, 1997).  Specially designed applications which take users with disabilities into 
consideration are available but these do not necessarily compare with the more popular applications. This 
chapter therefore aims to investigate various ways to provide alternative means of input which could facilitate 
use of the mainstream product by disabled users. 
 
These alternative means should also enhance the user experience for novice, intermediate and expert users. 
Findings from previous studies (Beelders, 2006; Blignaut, Dednam and Beelders, 2007) show that while novice 
users of word processors experience a number of obstacles in acceptance and usage of the application that are 
unique to the demographic, alternative pictorial icons, text buttons and translation of the interface into the native 
language of the user all failed to lessen the learning curve significantly or to increase usability significantly. 
However, these findings should not discourage researchers but should serve as encouragement to find more 
innovative and creative means of alleviating the burden on these users. Particularly since these users show 
remarkable eagerness and enthusiasm to learn, greater effort should be made to accommodate them to become 
mainstream users. Although the main focus could be to narrow the gap between novice and expert users, the 
means to achieve this should not alienate or disrupt the smooth flow of work that an expert user is capable of 
achieving. Rather, the improvements should serve not only the novice users but also provide an alternative 
means for experts as a way to improve their interaction with the product. The study that is reported in this 
chapter therefore proposes to be an extension or continuation of these aforementioned studies, and investigate 
further ways to improve the interface of a word processor for all user groups. 
 
The eye-tracker has steadily become more robust and reliable and cheaper and therefore, presents itself as a 
suitable tool for this use (Jacobs and Karn, 2003). However, much research is still needed to determine the most 
convenient and suitable means of interaction before the eye-tracker can be fully incorporated as a meaningful 
input device (Jacobs and Karn, 2003). However, the disadvantages associated with eye-tracking as an input 
device mean that it should be used with caution or as suggested by Istance, Spinner and Howarth (1996), it 
should ideally be combined with other input modalities which will provide a means to overcome the limitations 
of eye tracking, such as speech. As it is, Microsoft Office already comes bundled with an in-built speech engine 
which makes speech recognition available in all Office packages. There are also a number of affordable 
alternative speech engines available on the market. Eye-trackers may eventually become cost-effective enough to 
be a standard feature in future computing devices (Isokoski, 2000). However, given that the hardware and 
software is available, the task remains to prove that the eye-tracker improves the quality of human-computer 
interaction as validation for the inclusion in future devices (Isokoski, 2000). Although neither eye-tracking nor 
speech recognition is new to usability studies or as a potential source of increased usability, few studies have 
been found that use a combination of the two in a single package as a means of usability improvement. 
 
255 
 
 Publications 
Therefore, the aim of this study was to determine whether a multimodal interface, using non-traditional input 
means could be created for a word processing application. In this way, this popular application can cater for a 
more diverse group of users through a highly customisable interface. The following section will provide some 
background literature which serves as a foundation on which this study was based. 
2. Background 
This section will discuss some of the available literature which was used as a foundation for the study. 
2.1 Advantages for users 
The high incidence of afflictions such as tendonitis, carpal tunnel syndrome and repetitive strain injuries 
provides ample motivation to reduce typing requirements and device manipulation (Klarlund, 2003). Automatic 
speech recognition (ASR) offers an interaction means capable of replacing conventional typing.  
 
Moreover, the most sensible way of empowering disabled users is to provide them with a means to be able to use 
the same software applications as any other computer user, which requires that input devices specifically tailored 
for these users will have to be developed (Istance, Spinner and Howarth, 1996). Eye movement is ideal for such 
situations as it requires no additional training, is high-speed and the majority of motor impaired individuals still 
retain ocular motor abilities (Istance, Spinner and Howarth, 1996).  
2.2 Eye-tracking and human-computer interaction  
Eye-tracking has been used as an alternative input means in a number of applications (for example Gips and 
Olivieri, 1996; Hornof, Cavender and Hoselton, 2004; Kumar, 2007). The use of eye-tracking can be facilitated in a 
number of ways, for example dwell time (Isokoski, 2000), look and shoot (Isokoski, 2000) or eye gestures. The use 
of dwell time requires the user to look at a target for a certain amount of time before the target is activated. 
Alternatively, look and shoot requires an additional mechanism to be triggered whilst gazing at the desired 
target. For example, the user may be required to press a key on the keyboard to activate the target under the eye 
gaze. Gaze gestures require the users to complete a predefined set of eye movements to activate a command 
(Drewes and Schmidt, 2007). Gaze gestures have been used to successfully map the entire alphabet, thereby 
allowing users to type text using only their eye gaze (Wobbrock, Rubinstein, Sawyer and Duchowski, 2008). All 
of these selection methods will be incorporated into the proposed multimodal interface to allow for maximum 
customisation of the interface to suit the needs of the user at any given time. 
 
The role of feedback is also vital in the development of eye gaze applications (Hyrskykari, Majarants and Räihä, 
2003) and serves to increase the user efficiency and enjoyment (for example, Miniotas, Špako and Evreinov, 
2003). Therefore, during this study visual feedback will always be given when eye gaze is used as an interaction 
technique. 
 
Furthermore, even with advances in technology and continued research, most interfaces which are gaze sensitive 
are designed with oversized interface elements to facilitate easier acquisition and activation of the element 
(Ashmore, Duchowski and Shoemaker, 2005). The use of oversize targets impacts negatively on screen real estate 
as a lot of free space is now occupied by icons, buttons etc. To counteract both the impact on available screen real 
estate and to exploit the properties of Fitts’ Law several target expansion mechanisms have been proposed and 
implemented for both eye pointing and manual input (Ashmore, Duchowski and Shoemaker, 2005). These 
include expansion of the target in motor space, expanding or zooming into the entire display uniformly or 
expanding a portion of the display through the use of a fisheye lens (Ashmore, Duchowski and Shoemaker, 
2005). Expansion of the targets can be either visible or invisible when it occurs strictly in motor space, implying 
the user is not aware of the expansion. The idea behind invisible expansion is to create a larger selection area 
around the target without visual feedback. This allows room for error and slight displacement of the eye during 
target selection. Buttons used during this study for text input will be larger than the standard icons in Windows. 
Even so, invisible expansion of buttons will also be used for the onscreen keyboard. This invisible expansion will 
be referred to as a gravity well as the actual selectable area of the button will be larger than the physical size of 
the button. Once the eye gaze is detected within the bounds of the enlarged area of expansion, the button will 
become selectable, thus creating the impression that the eye gaze is drawn onto the button. Additional visible 
expansion capabilities, in the form of magnification triggered by the position of the eye gaze, will also be 
provided. 
256 
 
 Publications 
2.3 Eye-tracking and speech recognition in combination 
The limitations created by the lack of accuracy of eye-tracking equipment can be overcome by the simultaneous 
use of speech recognition (Castellina, Corno and Pellegrino, 2008). Insofar as can be ascertained these particular 
modalities are often used in isolation. When used in such a manner, these are often ambiguous but when 
appropriately used in combination they could result in effective interaction methods (Oviatt, 1999). This would 
create a multimodal interface, which is an interface that uses several input and output modalities in combination 
in an effort to assist human-computer communication through utilising natural human communication channels 
(Pireddu, 2007) such as voice and gaze.  
 
The underlying foundation of this research undertaking is the view that while eye gaze and speech recognition 
are prone to ambiguity when used in isolation, using them in combination may allow much of the problems to be 
overcome. User intent can be inferred by providing a means for the user to gaze at certain objects and then issue 
verbal commands which can then be executed to create a hands-free application (Hatfield and Jenkins, 1997). In 
this way it is envisaged that the strengths of one interaction technique will be able to compensate for the 
weaknesses of the other and together speech and vision should provide a better interaction experience than each 
in isolation.  Given the inherent problems associated with target selection via eye gaze, such as accuracy, stability 
and the Midas touch (everything the user gazes at is selected as the user is not accustomed to an interface which 
reacts to eye gaze) problem, it seems plausible that an additional modality might make selection easier and more 
feasible even though to date there have been very few empirical studies conducted to explore this  phenomenon. 
One such study did determine that there is high accuracy of target selection using eye gaze and speech to such an 
extent that user performance approaches that of manual pointing (Miniotas, Špakov, Tugoy and MacKenzie 
2006). Furthermore, integration of voice and speech for a multimodal interaction was shown to be a feasible 
option and an option that works well with robust eye trackers (Pireddu, 2007). 
 
EyeTalk is a voice and vision integrated application which allows a user to gaze at an object and issue a verbal 
command which is then captured and merged into a single message and passed to the current application as a 
mouse click or keyboard event (Hatfield and Jenkins, 1997). EyeTalk is application independent and can 
therefore be used with a multitude of standard applications. Users are able to fixate on an object, which causes 
the mouse cursor to move to that position, and then issue a command to execute a mouse click (Hatfield and 
Jenkins, 1997). Initial results with EyeTalk showed positive feedback and indicated that users were able to 
operate the system with high efficiency after just a few moments of getting accustomed to the system (Hatfield 
and Jenkins, 1997). A promising consequence of the EyeTalk application is the indication that a stand-alone 
application can be developed to interact with any Windows application without any need to re-engineer the 
entire existing application (Hatfield and Jenkins, 1997). 
3. Developed application 
The premise of the study that is reported in this chapter - to test the feasibility and usability of a multimodal 
interface for a word processor – necessitated that an application be developed for these purposes. Since Microsoft 
Word® enjoys the highest market penetration (Bergin, 2006) and also leads the way as the de facto interface 
standard; it was the focus of the study. Consequently, there were two options available, a complete application 
could be developed that emulated the look, feel and functionality of Word or the Word application itself could be 
used with data capturing capabilities being provided. 
 
Since Visual Studio for Office (VSTO) allows .NET developers to customise not only the interface of the Office 
suite but also to add functionality that is required (Anderson, 2009) it was decided to rather use the tried and 
tested application and add the required components. Therefore, VSTO was used to manipulate Microsoft Word 
to make a multimodal interface within a well-known environment. The integrated development environment 
(IDE) of Visual Studio 2008 was used for development with C# as the programming language. 
 
The Tobii Studio Software Development Kit (www.tobii.com) was used to add eye gaze functionality to the 
application and the Microsoft Speech Application Programming Interface (www.microsoft.com) was used to add 
speech capabilities. MagniGlass Pro® (http://magnifying-glass-pro.softutopia.com) was used for magnification 
purposes as it was fairly inexpensive and was the only tool that was found to allow interaction on the 
magnification itself. This means that the user could click on the magnified area and did not first have to close the 
magnification before being able to click, which defeats the purpose of using magnification for selection of small 
targets. 
 
Figure 1 shows the tab called “Multimodal Add-Ins” that was added to the ribbon in Word 2007.  The magnifier 
button allows the magnifying capabilities to be toggled on and off. Following this are the buttons to show and 
hide the onscreen keyboards. An alphabetic or standard QWERTY keyboard layout can be chosen. The onscreen 
257 
 
 Publications 
keyboards are used for hands-free text entry using eye gaze and speech recognition. The next button group 
manages the speech engine. The speech engine can be turned on and off, a trained speech profile can be selected 
and automatic speech recognition (ASR) can be used for either command or dictation purposes. The final group 
manages the eye gaze interaction technique. The first step when using eye gaze is to calibrate the eye-tracker. The 
calibration process has a significant effect on the accuracy of the eye gaze interaction technique. The gaze type 
can then be set. Dwell time (linked to the sensitivity setting), blinking and look and shoot (with the Enter Key) 
are all available. When the “no activation mechanism” is chosen, then eye gaze can be used in combination with 
speech recognition. The gaze shape dropdown allows the user to select the shape of the visual feedback cue on 
the letters of the onscreen keyboard. 
 
Figure 1: Multimodal Add-Ins tab in Microsoft Word 
 
The editable region of the document is shown in the figure as a much smaller area than what it was in reality. At 
the bottom of the screen, the onscreen QWERTY keyboard can be seen with the area directly under the current 
eye gaze being magnified. The yellow arrow indicates the exact position of the eye gaze. 
 
Speech recognition can be used for both dictation and command purposes. A simple grammar containing 
common formatting commands (for example bold, italic and underline), cursor movement (for example right, 
left, up and down) and text selection (for example, select a line, select a word, select whole document) commands 
was built. In this way it became possible to move around the document or select and manipulate text contained 
in the document without using either the mouse or the keyboard.  
 
The dwell time can be set by the user to a length of time with which they are comfortable. Blinking requires the 
user to blink in order to activate the object currently being fixated on. Since blinking is a natural occurrence, the 
blink required for this activation must be more pronounced. Finally, eye gaze can be used in combination with 
speech recognition as a text entry method using an onscreen keyboard. When the eye gaze is stable and directed 
at a certain key, the key is framed with a green square, or the selected shape (see Figure 2). This gives a visual 
cue/feedback to the user so that they know the key can now be activated. The user can then issue one of several 
verbal commands in order to type the selected letter to the document at the cursor position. The keys of the 
onscreen keyboard had a gravity well of 20 pixels on all sides. 
 
 
Figure 2: Onscreen keyboard framed in green when selected 
 
By providing all these functions and settings, a highly customisable interface was built within the well-known 
environment of Word. 
 
258 
 
 Publications 
4. User testing  
The scope of the project did not allow full-scale user testing to be conducted on all the interaction techniques, 
such as dwell time and blinking. Therefore, the user testing only concentrated on testing the combination of eye 
gaze and speech when used in a word processor. These interaction techniques could be used for two specific 
purposes, namely to issue commands in order to perform basic word processing tasks and to enter text within 
the document. These two types of tasks will be reported on separately within this chapter. 
 
Longitudinal testing was conducted over a ten week period with each participant attending one session per week 
at the same time and on the same day. During the first session, participants each trained their speech profile 
using the Microsoft speech training wizard. The participants were then introduced to the multimodal Word that 
they would be using for the next few weeks and were given a brief tutorial of the speech grammar which was 
available for use in Word. The participants were then encouraged to interact with the application and to use all 
the verbal commands as well as attempting to type a full sentence using the onscreen keyboard and the 
interaction technique of eye gaze and speech. Every subsequent session followed the same procedure, which was 
to complete the list of preset task as quickly and correctly as possible.  
 
4.1 User testing of speech commands 
The use of speech commands and how their performance compares with that of the mouse and keyboard will be 
investigated first. 
4.1.1 Participants 
In total there were 25 participants who participated in the longitudinal study. They were all undergraduate 
students who were completing their studies at the University of the Free State, South Africa. A pre-requisite for 
participation in the study was sufficient computer literacy as well as word processor expertise.  
 
There were 17 male participants and 8 female participants with an average age of 21.1 (standard deviation = 1.9). 
Six participants indicated that English was their first language, 7 Afrikaans and the remainder (12) were African 
language speakers. Since the University employs a parallel medium tuition policy where classes are offered in 
either English or Afrikaans, all students are comfortable in either English or Afrikaans. Therefore, each session 
was conducted in the tuition language of the participant. 
4.1.2 Tasks 
Participants had to complete 20 tasks, five of which were typing tasks. The majority of the other tasks, for 
example selection and formatting, had to be completed using the traditional means of a mouse or keyboard. A 
similar task then had to be repeated using speech recognition. The tasks were set up in such a way that the same 
types approximately required an equal number of minimum actions to complete it successfully. A summary of 
the tasks is tabulated below (with typing tasks omitted): 
  
Task Description Shortened task Keyboard Speech 
description 
Select three lines and apply formatting Line selection and 1 1 
such as bold or italics formatting 
Select all text in the document and remove Select all text and 1 1 
it by deleting or cutting remove 
Select two words and make them bold Select words and 1 1 
format 
Paste previously copied text at the current Paste 1 1 
cursor position 
Undo the previous action Undo 1 1 
Select a single word and copy it Select word and 1 1 
copy 
Position the cursor at a certain position in Position and paste 1 1 
the document and paste the previously 
copied text 
Table 1: Grouped tasks as divided between interaction techniques 
259 
 
 Publications 
4.1.3 Measurements 
The measurements that will be analysed are the time taken to complete the task as well as the number of actions 
that were required to complete the task. The number of errors was also considered as a means to determine how 
effective the interaction technique is. However, since there are multiple ways to complete a task, it became very 
difficult to pinpoint exactly what was an erroneous action, particularly where the mouse or keyboard was used. 
For the speech, the commands that could complete the task could be isolated as an acceptable set of commands 
for that task and then any command issued that is not a member of that set can be flagged as an error command. 
However, since there is considerable risk for potentially flagging an action as an error when it might not be, it 
was decided that the percentage of the task completed correctly were better indicators of the effectiveness of the 
interaction techniques.  
4.1.4 Time to complete a task 
The time to complete the task was measured from when the task was started to when the task was considered by 
the participant to be completed. This time included the time it took the participant to read the description of the 
task. Since similar tasks had virtually identical wording it was assumed that they would require the same 
amount of time to read and that, therefore, the time to read would not have an effect on the time required to 
complete the task.  
 
The charts below (Figures 3-6) plot the least square means for both interaction techniques over all sessions. The 
least squares means are the means of interest when interpreting significant results of a factorial design (StatSoft, 
2010) and will therefore be provided as a visual representation of the descriptive statistics. The vertical bars 
denote a 95% confidence interval. The blue line plots the completion time for the speech and the red line that of 
the keyboard. 
 
   
Figure 3: Average completion times for (a) line selection and formatting and (b) select all and remove 
 
  
Figure 4: Average completion times for (a) select words and format and (b) paste 
 
260 
 
 Publications 
  
Figure 5: Average completion times for (a) undo and (b) select word and copy 
 
  
Figure 6: Average completion times for position and paste 
 
As can clearly be seen from the graphs above, in some instances the keyboard maintained a faster average 
completion time and in others the speech interaction technique could surpass the performance of the keyboard. 
 
The time measurements were in seconds and there were a vast number of instances in which the normality tests 
fail for the data. In order to combat this, the time measurement was converted to 1/time.  
 
For each of the tasks, the following hypotheses were formulated:  
3. H0,1: There is no difference between the time required to complete the tasks when using the mouse and 
keyboard or speech commands. 
4. H0,2: Participants did not improve over time with regard to the time taken to complete the tasks. 
 
A repeated-measures within-subjects ANOVA was performed to analyse the aforementioned hypotheses. Where 
necessary, the adjusted corrections of Geisser-Greenhouse and Huyn-Feldt were applied to the degrees of 
freedom in the cases where the assumption of sphericity was not met. The table below shows only the results of 
the original ANOVAs and not, for the sake of brevity, the results of the adjusted corrections. For the Paste task, 
there was significant interaction between the factors of interaction technique (keyboard and speech) and 
improvement over time (session) the two hypotheses had to be examined in isolation. 
 
 H0,1 H0,2 
Line selection and formatting F(1, 23) = 0.286,  F(8, 184) = 14.040,  
p > 0.05 p < 0.05 
Select all and remove F(1, 23) = 4.328,  F(8, 184) = 15.197,  
p < 0.05 p < 0.05* 
Select words and format F(1, 26) = 10.447,  F(8, 208) = 9.487,  
p < 0.05 p < 0.05 
Paste   
Undo F(1, 24) = 0.001,  F(8, 192) = 22.148, 
p > 0.05 p < 0.05 
Select word and copy F(1, 22) = 3.655,  F(8, 176) = 3.470,  
p > 0.05 p < 0.05 
Position and paste F(1, 22) = 15.448,  F(8, 176) = 5.123, 
p < 0.05 p < 0.05 
Table 2: Results of ANOVA for time of speech commands 
 
261 
 
 Publications 
The first null hypothesis could be rejected for the task which required all text to be selected and removed. In this 
instance, it was the speech commands which averaged a faster completion time. Conversely, the keyboard was 
significantly faster for the task where words had to be selected and formatted as well as for the position and 
paste task. This finding could imply that the speech command to select all text was fairly intuitive and easy to 
learn, which facilitated a faster completion time than using the mouse or keyboard. However, selection of 
individual words was less intuitive and took longer than when using the keyboard or mouse. It could also mean 
that participants did not use the keyboard shortcut to select all text as this is the fastest way of selecting all text in 
a document. Analysis of the number of actions should provide more clarity in this regard. 
 
For those tasks where the second null hypothesis could be rejected, it was under the majority of cases the first 
few sessions which differed significantly from the last sessions. This provides a very encouraging finding that 
there is a significant effect of learning which occurs as the amount of exposure to the application is increased. 
 
When a repeated-measures within-subjects ANOVA was performed for the paste task, it was found that there 
was significant interaction between the two factors of session and interaction technique (F(8, 192) = 2.356, p < 
0.05). Therefore, it was imperative that each factor was isolated and analysed separately to preclude the 
interaction with the other factor having an effect on the analysis. Firstly, H0,1 was evaluated by isolating each 
session individually and testing for a difference between interaction techniques. For brevity’s sake, the actual 
results of the ANOVA will not be reported here. Suffice it to say that, at an α-level of 0.05, there was a significant 
difference between the interaction techniques in every session. Therefore, the completion time is significantly 
better for speech than for the keyboard and mouse throughout all the sessions. Secondly, H0,2 was evaluated 
using a repeated-measures within-subject ANOVA but testing each interaction technique separately. 
Consequently, it was found that H0,2 could be rejected for both the speech interaction technique (F(8, 96) = 17.727, 
p < 0.05) and the keyboard and mouse (F(8, 96) = 6.883, p < 0.05).  
4.1.5 Number of actions 
The next measurement to be analysed was the number of actions that were performed during task completion. 
Actions were defined as any mouse click, button press or speech command that was issued during completion of 
the task. The number of actions were measured per interaction technique and per session for each participant 
and then, as always, outliers were removed from the data set prior to analysis. 
  
The underlying hypotheses were formulated to analyse the actions for this task: 
H0,1: The interaction technique does not significantly affect the number of actions required to complete the task. 
H0,2: Participants did not improve over time with regard to the number of actions required to complete the task. 
 
The charts below (Figures 7-10) plot the number of actions for each interaction technique over all sessions. The 
red line plots the keyboard and mouse actions, while the blue plots the speech commands. 
 
  
Figure 7: Average number of actions for (a) line selection and formatting and (b) select all and remove 
 
 
262 
 
 Publications 
  
Figure 8: Average completion times for (a) select words and format and (b) paste 
 
  
Figure 9: Average completion times for (a) undo and (b) select word and copy 
  
Figure 10: Average completion times for position and paste 
 
The graphs clearly show that in most instances the use of the keyboard and mouse resulted in more actions being 
performed. It was only when participants were required to position the cursor and paste previously copied text 
that the speech commands required more actions. The table below summarises the results of the repeated-
measures within-subjects ANOVA for each task.  
 
 H0,1 H0,2 
Line selection and formatting   
Select all and remove F(1, 18) = 8.574,  F(8, 144) = 2.562,  
p < 0.05 p < 0.05 
Select words and format F(1, 23) = 2.598,  F(8, 184) = 2.234,  
p > 0.05 p < 0.05 
Paste F(1, 15) = 6.287,  F(8, 120) = 1.297,  
p < 0.05 p > 0.05 
Undo F(1, 24) = 2.294,  F(8, 192) = 2.934, 
p > 0.05 p < 0.05 
Select word and copy F(1, 19) = 3.498,  F(8, 152) = 1.378, 
p > 0.05 p > 0.05 
Position and paste   
Table 3: Results of ANOVA for actions of speech commands 
 
263 
 
 Publications 
In the two instances where there was a significant difference between the interaction techniques, it was the 
speech commands which required significantly less actions than the keyboard. This result for the selection and 
removal of all text and the paste task corresponds with the findings that the speech commands were also more 
efficient, in terms of the time required to complete a task, for these tasks. 
 
For the task which requires that words be selected and formatted, session 2 had a significantly higher number of 
actions than any other session. During the undo task, session 3 resulted in a significantly larger number of 
actions than the other sessions. 
 
The two tasks for which there are no results in the above table had significant interaction between the two 
factors. This meant that individual analyses had to be performed in order to counteract the effect of one factor on 
another. For the line selection and formatting task, the two interaction techniques differed significantly from one 
another during the second and eighth session. During the other sessions the number of actions for the two 
interaction techniques was comparable to one another. The second null hypothesis could be rejected for the 
keyboard, where a significantly higher number of actions were performed during session 2 than all the other 
sessions, but not for the speech commands. Closer inspection of the analysis revealed that some participants 
resorted to using longer methods of text selection when using the keyboard. For example, they would select the 
text one character at a time instead of using the efficient means which were available. Since it appears that the 
majority of the participants used the mouse for selection purposes, the fact that there was a minority who 
employed this very inefficient means was not cause for great concern but cognisance was taken thereof. 
 
For the task where the cursor had to be positioned and text pasted at that specific location, speech required 
significantly more actions than the keyboard during all the sessions. Even though the number of actions 
decreased over the sessions, which indicates learning, the learning did not allow the speech to perform on a 
comparable level to the keyboard. The higher number of actions for the speech interaction technique could be 
explained by the types of commands that were issued. Therefore, an analysis was conducted to determine which 
commands were issued during the completion of this task. This showed a high incidence of the command ‘Right’ 
which could be used to move the cursor to the right. This indicated that the participants resorted to moving the 
cursor to the correct position one character at a time. Obviously very few participants realised that they could use 
the command ‘Select word’ and then ‘Right’ to move the cursor to the right a word at time. Since the keyboard 
and mouse offers the alternative of simply clicking the mouse pointer at the correct position this could account 
for the significant difference between the two interaction techniques. This finding could mean that the 
participants do not seek to find the most efficient method of task completion.  
 
The ANOVA performed to evaluate H0,2 for the speech commands showed that there was a significant difference 
between the sessions (F(8, 64) = 5.820, p < 0.05*). Post-hoc tests indicated that there was significant improvement 
between session 2 and the remainder of the sessions. 
4.1.5 Discussion 
The speech interaction technique performed relatively well when compared with the keyboard and mouse, in 
some instances even surpassing the performance of the traditional input methods. Clearing of all text in the 
document and pasting were even faster and completed with less actions than when using the keyboard and 
mouse. It is only when positioning within the document must occur that the keyboard outperforms the speech 
interaction technique in terms of both the time that it takes and the number of commands that are issued. 
 
While this finding was very encouraging, the most promising finding was that there was continued 
improvement in the efficiency with which the task was completed. Even though the improvement between 
subsequent sessions was not always significant the fact there is continual improvement hints at the possibility 
that the two interaction techniques could eventually compete on a comparable level for all tasks or that the 
speech interaction technique could eventually perform better.  
 
Since there are often multiple options available to the user to complete the task when using the traditional means, 
the most effective method was not always chosen. This was also noticed when using speech to move the cursor. 
Rather the user chooses the method which results in an intermediate action which is closer to the final result 
when in reality there is a shorter method that can be used. 
 
The fact that the speech commands resulted in less actions for most of the tasks, may be attributed to the fact that 
the grammar was fairly simple and provided commands to complete basic operations only. The complexity of 
the options provided by Word is much higher than accommodated in the grammar. When using Word in the 
normal capacity there is, more often than not, at least 3 different ways to complete a task which may place an 
added burden on the user of the application. However, the goal of the study was not to provide a complete 
264 
 
 Publications 
alternative to the keyboard and mouse but rather to determine whether common word processing tasks could be 
achieved using an alternative interaction technique. Therefore, by the very nature of the study, the grammar was 
required to be simple in composition.  
4.1.6 Further research 
The tasks that were chosen for this part of the study were chosen as some of the more common tasks that may 
occur in the word processing application. Therefore, they may be viewed as some of the less complex tasks and 
other tasks may require less intuitive commands and more complex commands. However, this will parody the 
nature of any other system which provides access to common tasks “at your fingertips”, for example the Home 
tab in Office while lesser used tasks or more complex tasks require further navigation and perhaps a heavier 
burden on one’s memory. It may be possible to extend the grammar to encompass many more tasks within the 
word processor application. Another consideration would be to use a default smaller grammar and an optional 
extended grammar that can be activated on request. 
 
The results of the study indicate that interaction through speech could dramatically increase the efficiency of 
end-users. However, it remains to be seen if this result holds when the user is free to use the grammar in a 
normal setting. This would require that the participants would not be given small separate tasks but rather that 
they would have to compile a document from scratch with pre-defined formatting. 
 
Whether or not an extended grammar is considered, further research will have to be done where the exposure to 
the application is lengthened in order to determine whether the learning effect can continue to an even greater 
degree. This study could use a smaller sample as it has already been established that it is possible to use this 
interaction technique effectively. 
 
4.2. User testing of text input 
As previously mentioned, the longitudinal testing also included tasks which required that the participants input 
text using either the keyboard or eye gaze and speech recognition. This section is a discussion of the comparative 
study between these two text input methods. 
4.2.1 Participants 
The participants for this analysis were the same as in the previous section. There were, however, three of the 25 
participants who were unable to type using eye gaze and speech for various reasons and they were excluded 
from the analysis. Fourteen of the remaining participants were male and 8 were female, 6 were English-speaking, 
6 Afrikaans-speaking and the remainder (10) had an African language as their first language. The average age of 
participants was 21.1 (standard deviation = 2.0). 
4.2.2 Tasks 
In total there were two typing tasks using the keyboard and three using the eye gaze and speech.  The tasks 
required participants to type phrases that were randomly selected from a set of 35 preselected tasks, which were 
in turn selected from the 500 everyday commonly used phrases as determined by MacKenzie and Soukoreff 
(2003). 
 
When using eye gaze and speech the size of the buttons was set to 60×60 (≈1.55° visual angle) pixels. Buttons 
were spaced 60 pixels apart with a gravity well of 20 pixels on all sides of each button. Although there were three 
typing tasks using these settings, only the last two of each session were included in the analysis. This was due to 
the fact that the first one was viewed more as a practice typing task to reacclimatise the participants to typing 
using eye gaze and speech. The participants were not told that the first task would not count towards the 
analysis and were instructed to complete all tasks to the best of their ability.  
 
In order to investigate the effect of size and spacing between targets, additional typing tasks were added from 
the fifth session onwards. Within these additional typing tasks, the first one had to be completed using the 
originally sized and spaced buttons. The next two had to be completed with buttons that were 50×50 (≈1.29° 
visual angle visual angle at a viewing distance of 600 mm) pixels in size and spaced 70 pixels apart. Following 
this there were another two tasks which had to be completed using buttons that were also 50×50 pixels in size 
but were spaced 60 pixels apart. For all typing tasks a gravity well of 20 pixels on all sides of the buttons were 
employed. 
 
265 
 
 Publications 
4.2.3 Measurements 
Since both input methods (the keyboard and eye gaze and speech recognition) were character based, the 
measurements that were selected for analysis were the character error rate and the characters typed per second. 
The character error rate (CER) measures how many insertions, deletions and substitutions have to be done to 
convert the presented text to the text as entered by the participant (Read, 2005). This measurement is 
synonymous with the Levenshtein distance between two strings (Levenshtein, 1966) divided by the number of 
characters that were typed (Read, 2005; MacKenzie and Soukoreff, 2002). This error rate measurement will be 
used in this section to analyse the effectiveness of the interaction techniques. 
 
For the efficiency of the interaction techniques, the measurement of characters per second (CPS) will be used. 
This measurement divides the number of characters that were typed by the time taken in seconds. Similar to 
previous studies (MacKenzie, 2002), the time taken was measured from the time when the first character was 
typed to the time the last character was typed. This excludes the time required to read the question, including the 
sentence that must be typed, and the time taken to locate the first character that must be typed. As a 
consequence, the number of characters becomes n-1. 
4.2.4 Results 
The initial analysis will only include the data from the original typing tasks using the originally sized buttons.  
 
The leftmost chart below shows the average error rate for input through eye gaze and speech (blue line) and the 
keyboard (red line). The chart on the right shows the characters per second that were achieved with both 
interaction techniques and for all sessions. Clearly, the technique of eye gaze and speech results in far more 
errors than the keyboard when used for text entry while the keyboard facilitates a faster typing speed. Although 
the error rate of eye gaze and speech declines as exposure increases, the typing speed does not increase 
significantly. This could indicate that either more practice is required to increase typing speeds or that the typing 
speed quickly reaches a plateau which cannot be breached. Observation of the participants during their 
interaction with the system would suggest that more practice is required to increase the efficiency of the text 
entry. 
 
  
Figure 11: Least squares mean plot of character error rate and characters per second 
 
Using a confidence interval of 95%, it was found that the interaction technique had a significant effect on the 
number of errors made (F(1, 21) = 6.516, p < 0.05) but that there was also a significant difference between the 
sessions (F(8, 168) = 2.278, p < 0.05). In particular, sessions 9 and 10 differed significantly from sessions 2 and 3. 
This shows a measure of improvement in the error rate as time went by and would suggest that participants 
were becoming more accustomed to using eye gaze and speech for text input purposes. 
 
Similarly, the interaction technique (F(1, 21) = 54.704, p < 0.05) had a significant effect on the characters typed per 
second but there was no significant difference between the sessions (F(8, 168) = 1.385, p > 0.05). Therefore, using 
eye gaze and speech for typing is significantly slower than when typing with the keyboard but there is no 
significant improvement in typing speed as exposure to the system increases. 
 
The next step was to analyse text input that includes the additional tasks and differently sized and spaced 
buttons.  Since the additional tasks were only completed from session 5 onwards. The analysis was done for 
these sessions only. In order to distinguish between the different sized buttons, results for the originally sized 
and spaced buttons will be referred to as speech-L, the smaller widely spaced buttons as speech-SW and the 
smaller closely spaced buttons as speech-SC.  
 
266 
 
 Publications 
The graphs below plot the error rate and characters per second for each of the text entry methods for the sessions 
during which they were tested. 
 
 
 
Figure 12: Least squares mean plot of character error rate and characters per second for all typing tasks 
 
The keyboard has the lowest error rate of all the interaction techniques and it also has the highest typing speed. 
Regarding the error rate and typing speed of the eye gaze and speech, the three different methods are virtually 
indistinguishable from one another. 
 
The interaction technique (F(3, 44) = 4.100, p < 0.05) causes a significant difference in the error rate but there is no 
significant difference between the error rates of the various sessions (F(5, 220) = 1.056, p > 0.05). Post-hoc tests 
indicate that there is a significant difference between the error rates of the keyboard and those of the speech-SW 
interaction technique. In terms of typing speed, the interaction technique (F(3, 44) = 148.369, p < 0.05*) 
significantly affects this measurement as does the session (F(5, 15) = 3.002, p < 0.05*). As could be expected the 
keyboard results in a significantly faster typing speed than all other interaction techniques. The typing speeds in 
the last session were also significantly faster than the speeds of the first two sessions which indicates some 
measure of learning. 
4.2.5 Discussion 
It was found that the eye gaze and speech interaction technique causes a significantly higher error rate than the 
keyboard. There was no difference between the error rates of speech-L, speech-SW and speech-SC and they all 
differed from the keyboard at some stage. However, the interaction technique of speech-L did seem to offer the 
most improved error rate as it did not differ from the keyboard when analysed for the later sessions only. In 
some instances there was improvement over the sessions, which indicates some measure of learning when using 
eye gaze and speech. If the learning effect can be maintained, more practice could possibly lead to an 
effectiveness measurement which is comparable to that of the keyboard. 
 
In terms of efficiency (characters per second), the keyboard outperformed the eye gaze and speech interaction 
technique. The efficiency of eye gaze and speech also did not improve as exposure increased. This could either 
indicate that more practice is needed to achieve increased speed or that the typing speed quickly reaches the 
fastest achievable rate. Neither the size of the buttons nor the spacing between buttons affected the efficiency of 
the eye gaze and speech. 
4.2.6 Further research 
Further research can be conducted whereby the participants receive more practice with using eye gaze and 
speech as a text input mechanism. This will allow more detailed analysis to be performed in order to determine 
whether a much longer period of exposure would serve to increase the effectiveness and efficiency of the 
interaction technique. Furthermore, future studies could incorporate the correction of errors so that the character 
267 
 
 Publications 
error rate could determine the eventual correctness of the transcribed text in conjunction with the transcribed text 
before corrections were applied. 
 
Since it was found that neither the size of the buttons nor the spacing between the buttons influenced the 
usability of the interaction technique, further tests can be conducted to determine whether an increase in the 
gravity well will impact performance. Although the decrease of physical size and increase of gravity well result 
in a selectable area with the same size as a large button, the perceived accuracy with smaller buttons could serve to 
boost the confidence, and therefore satisfaction, of end-users. 
5. Conclusion 
This chapter reported on the results of similar word processing tasks which were compared when they were 
completed using the mouse and keyboard or when using speech commands. The measurements which were 
analysed were time to complete the task and the number of actions that were performed during completion of 
the task. For the majority of the tasks it was found that the interaction techniques could compete on a comparable 
level, particularly as the participant gained experience. This indicates that the application was indeed learnable.  
These results indicate that the proposed use of speech commands within a word processor application is viable.  
 
This chapter also reported on the results of the use of eye gaze and speech for text input when compared to a 
traditional keyboard. Measurements of effectiveness, namely the error rate, and efficiency, namely characters 
typed per second were analysed. It was found that when using eye gaze and speech for text input, neither the 
size of the buttons nor the spacing between the buttons affected the performance of the interaction technique. 
The performance of the keyboard for both these usability measures far outstrips that of the eye gaze and speech. 
Even with extended exposure to the eye gaze and speech interaction techniques, the effectiveness and efficiency 
could not reach levels which were equivalent to those achieved by the keyboard. 
6. References 
Ashmore, M., Duchowski, A.T. & Showmaker, G. (2005). Efficient Eye Pointing with a Fisheye Lens. In 
Proceedings of Graphics Interface 2005 
Beelders, T.R. (2006). A comparative study on users’ responses to graphics, text and language in a word 
processor interface. M.Sc dissertation, University of the Free State, Bloemfontein, South Africa 
Bergin, T.J. (2006). The Origins of Word Processing Software for Personal Computers: 1976 – 1985. IEEE Annals of 
the History of Computing. 28(4), pp. 32-47 
Blignaut, P.J., Dednam, E.H. & Beelders, T.R. (2007). Die opleiding van persone uit benadeelde groepe in 
rekenaargebruik: Is die agterstand nie té groot om te oorbrug nie? Suid-Afrikaanse Tydskrif vir Natuurwetenskap en 
Tegnologie, 26(3) 
Castellina, E., Corno, F., & Pellegrino, P. (2008). Integrated Speech and Gaze Control for Realistic Desktop 
Environments. In Proceedings of ETRA 2008 
Drewes, H. & Schmidt, A. (2007). Interacting with the Computer using Gaze Gestures. In Proceedings of the 11th 
IFIP TC13 International Conference on Human-Computer Interaction, INTERACT 2007, Rio de Janeiro, Brazil, September 
2007 
Gips, J. & Olivieri, P. (1996). EagleEyes: An Eye Control System for Persons with Disabilities. In Proceedings of The 
Eleventh International Conference on Technology and Persons with Disabilities, Los Angeles, March 1996 
Hatfield, F. & Jenkins, E.A. (1997). An interface integrating eye gaze and voice recognition for hands-free 
computer access. In Proceedings of the CSUN 1997 Conference 
Hornof, A., Cavender, A & Hoselton, R. (2004). EyeDraw: A system for drawing pictures with eye movements. 
ASSETS 2004 
Hyrskykari, A., Majaranta, P. & Räihä, K-J. (2003). Proactive response to eye movements. In M. Rauterberg et al. 
(Eds.), Human-Computer Interaction -- INTERACT'03, IOS Press, pp. 129-136 
Isokoski, P. (2000).Text input methods for eye trackers using off-screen targets. In Proceedings of ETRA 2000 
Istance, H.O., Spinner, C. & Howarth, P.A. (1996). Providing motor impaired users with access to standard 
Graphical User Interface (GUI) software via eye-based interaction. In Proceedings of  1st European Conference on 
Disability, Virtual Reality and Associated Technology, Maidenhead, UK 
Jacobs, R. J. (1993). Advances in Human-Computer Interaction, Vol. 4. In H.R. Hartson and D. Hix (eds.), Eye 
Movement-Based Human-Computer Interaction Techniques: Toward Non-Command Interfaces, pages 151–190. Ablex 
Publishing Co 
268 
 
 Publications 
Jacob, R.J.K. & Karn, K.S. (2003). “Eye Tracking in Human-Computer Interaction and Usability Research: Ready 
to Deliver the Promises (Section Commentary),” in J. Hyona, R. Radach, and H. Deubel (eds.), The Mind's Eye: 
Cognitive and Applied Aspects of Eye Movement Research, pp. 573-605, Amsterdam, Elsevier Science 
Klarlund, N. (2003). Editing by Voice and the Role of Sequential Symbol Systems for Improved Human-to-
Computer Information Rates. In Proceedings of ICASSP 
Kumar, M. (2007). Gaze-enhanced user interface design. PhD Thesis, Stanford University. 
Miniotas, D., Špakov, O. & Evreinov, G. (2003). Symbol Creator: An alternative eye-based text entry technique with low 
demand for screen space. In M. Rauterberg et al. (Eds.) Human Computer Interaction – INTERACT ’03, pp. 137-143 
Miniotas, D., Špakov, O., Tugoy, I. & MacKenzie, I.S. (2006). Speech-Augmented Eye Gaze Interaction with Small 
Closely Spaced Targets. In Proceedings of the 2006 symposium on Eye tracking research and applications (ETRA), San 
Diego, California, pp. 67-72 
Oviatt, S. (1999). Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the 
ACM SIGCHI 99, Pittsburgh, Pennsylvania, United States, pp. 576 – 583. New York: ACM Press 
Pireddu, A. (2007). Multimodal Interaction: An integrated speech and gaze approach. Thesis submitted at 
Politecnico di Torino 
Van Dam, A. (2001). Post-Wimp user interfaces: The human connection. In R. Earnshaw, R. Guedj, A. van Dam and J. 
Vince (Eds), Frontiers of human-centred computing, online communities and virtual environments (pp. 163-178). 
London, Great Britain:Springer-Verlag 
Wobbrock, J.O., Rubinstein, J., Sawyer, M.W. & Duchowski, A.T. (2008). Longitudinal evaluation of discrete 
consecutive gaze gestures for text entry. In Proceedings of the 2008 Symposium on Eye Tracking Research and 
Applications (ETRA), Savannah, Georgia, United States of America, pp. 11-18 
 
  
269 
 
 
SUMMARY 
 
Multimodal interfaces may herald a significant improvement on current GUIs which have been commonplace 
until now. It is also possible that a multimodal interface could provide a more intuitive and natural means of 
interaction which, simultaneously, negates the reliance on traditional, manual means of interaction. Eye gaze 
and speech are common components of natural human-human communication and were proposed for use in 
a multimodal interface for a popular word processor for the purposes of this study. 
In order for a combination of eye gaze and speech to be a viable interface for a word processor, it must 
provide a means of text entry and facilitate editing and formatting of the document contents. For the purposes 
of this study a simple speech grammar was used to activate common word processing tasks, as well as for 
selection of text and navigation through a document. For text entry, an onscreen keyboard was provided, the 
keys of which could be pressed by looking at the desired key and then uttering an acceptable verbal command. 
These functionalities were provided in an adapted Microsoft Word 2007® to increase the customisability and 
possibly the usability of the word processor interface and to provide alternative means of interaction. The 
proposed interaction techniques also had to be able to execute typical mouse actions, such as point-and-click. 
The usability of eye gaze and speech was determined using longitudinal user testing and a set of tasks specific 
to the functionality.  
Results indicated that the use of a gravitational well increased the usability of the speech and eye gaze 
combination when used for pointing-and-clicking. The use of a magnification tool did not increase the usability 
of the interaction technique. The gravitational well did, however, result in more incorrect clicks due to natural 
human behaviour and the ease of target acquisition afforded by the gravitational well. However, participants 
learnt how to use the interaction technique over the course of time, although the mouse remained the 
superior pointing device. 
 Speech commands were found to be as usable, or even more usable, than the keyboard and mouse for editing 
and selection purposes, although navigation was hindered to some extent. For text entry purposes, the 
keyboard far surpasses eye gaze and speech in terms of performance as an input method as it is both faster 
and results in fewer errors than eye gaze and speech. However, even though the participants were required to 
complete a number of sessions and a number of text entry tasks per session, more practice may be required 
for using eye gaze and speech for text entry. Subjectively, participants felt comfortable with the multimodal 
interface and also indicated that they felt improvement as they progressed through their sessions. 
Observations of the participants also indicated that as time passed, the participants became more adept at 
using the multimodal interface for all necessary interactions. 
In conclusion, eye gaze and speech can be used instead of a pointing device and speech commands are 
recommended for use within a word processor in order to accomplish common tasks. For the purposes of text 
entry, more practice is advocated before a recommendation can be made. Together with progress in hardware 
development and availability, this multimodal interface may allow the word processor to further exploit 
emerging technologies and be a forerunner in the use of multimodal interfaces in other applications. 
 
Keywords: Multimodal interfaces, gaze-controlled interfaces, speech-controlled interfaces, eye-tracking, 
speech recognition, word processing, usability 
  
270 
 
 
OPSOMMING 
 
Multi-modale koppelvlakke kan ’n betekenisvolle bydrae lewer tot grafiese gebruikerskoppelvlakke soos wat 
dit die afgelope tyd bekend was.  Dit is ook moontlik dat multi-modale koppelvlakke ’n meer intuïtiewe en 
natuurlike interaksie-medium kan bied om die afhanklikheid van tradisionele handbeheerde interaksie 
tegnieke te verminder.  Visie en spraak is alledaagse komponente van natuurlike mens-tot-mens 
kommunikasie en word in hierdie studie ook voorgestel vir gebruik in ’n multi-modale koppelvlak vir ’n gewilde 
woordverwerkingspakket. 
Om lewensvatbaar te wees in die koppelvlak van ’n woordverwerkingspakket, moet ’n kombinasie van visie en 
spraak die invoer van teks, redigering asook formatering van ’n dokument, fasiliteer.  Vir die doeleindes van 
hierdie studie is ’n beperkte stel mondelinge opdragte gebruik om alledaagse woordverwerkingsopdragte, 
sowel as die seleksie van teks en navigering in ’n dokument, te aktiveer.  Met die oog op teksinvoer is ’n 
visuele sleutelbord op die skerm vertoon.  ’n Sleutel kon geaktiveer word deur daarna te kyk en dan ’n gepaste 
opdrag te uiter.  Hierdie funksionaliteite is in ’n aangepaste Microsoft Word 2007® woordverwerkingspakket 
geïmplementeer om die aanpasbaarheid en moontlik ook die bruikbaarheid van die woordverwerkings-
koppelvlak te verhoog en om alternatiewe interaksietegnieke te voorsien.  Die voorgestelde interaksietegnieke 
moes ook geskik wees om tipiese muis-aksies, byvoorbeeld wys-en-kliek, uit te voer.  Die bruikbaarheid van 
visie en spraak is bepaal deur longitudinale gebruikerstoetsing en ’n stel take wat op spesifieke funksionaliteite 
betrekking het. 
Die resultate het aangedui dat die gebruik van ’n gravitasieput die bruikbaarheid van die kombinasie van 
spraak en visie tydens wys-en-kliek aksies verhoog het.  Die gebruik van ’n vergrotingspakket het nie die 
bruikbaarheid van die interaksietegniek verhoog nie.  Natuurlike menslike gedrag en die gemak waarmee 
teikens gekliek kon word deur gebruik van ’n gravitasieput, het egter veroorsaak dat die gravitasieput meer 
foutiewe klieks tot gevolg gehad het.  Deelnemers het egter mettertyd geleer om die tegniek te gebruik, 
alhoewel die muis steeds die beste wysertoestel gebly het. 
Dit is verder bevind dat mondelinge opdragte net so goed of selfs beter is vir redigering en seleksie as die 
sleutelbord en muis, alhoewel navigering in ’n mate gekortwiek is.  Die sleutelbord is verreweg die beste 
tegniek om teks in te voer aangesien dit vinniger was en deelnemers ook minder foute gemaak het as met 
spraak en visie.  Alhoewel deelnemers ’n aantal take uitgevoer het tydens ’n hele paar sessies, mag meer 
oefening nodig wees om spraak en visie vir teksinvoer te gebruik.  Subjektiewe terugvoer van deelnemers het 
aangedui dat hulle gemaklik was met die multi-modale koppelvlak en dat hulle ervaar het dat hulle van een 
sessie tot die volgende verbeter het.  Dit is ook waargeneem dat deelnemers meer bedrewe geraak het met 
oefening en die multi-modale koppelvlak mettertyd vir al die nodige interaksies kon gebruik. 
Ter opsomming is dit duidelik dat spraak en visie gebruik kan word in die plek van ’n wysertoestel en dit word 
aanbeveel dat mondelinge opdragte gebruik word om alledaagse woordverwerkingstake uit te voer.  Dit is 
nodig dat deelnemers meer oefening in teksinvoer moet kry voordat ’n aanbeveling gemaak kan word.  Hierdie 
multi-modale koppelvlak kan, in samehang met die ontwikkeling en beskikbaarheid van apparatuur, die 
woordverwerker toelaat om nuwe tegnologieë te ontgin en die weg te baan vir gebruik van multi-modale 
koppelvlakke in ander toepassings. 
 
 
271