Multimodal Human Computer Interaction: A Survey

  • Alejandro Jaimes
  • Nicu Sebe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3766)


In this paper we review the major approaches to multimodal human computer interaction from a computer vision perspective. In particular, we focus on body, gesture, gaze, and affective interaction (facial expression recognition, and emotion in audio). We discuss user and task modeling, and multimodal fusion, highlighting challenges, open issues, and emerging applications for Multimodal Human Computer Interaction (MMHCI) research.


Facial Expression Emotion Recognition Gesture Recognition Facial Expression Recognition Dynamic Bayesian Network 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, J.K., Cai, Q.: Human motion analysis: A review. CVIU 73(3), 428–440 (1999)Google Scholar
  2. 2.
    Application of Affective Computing in Human-computer Interaction. Int. J. of Human-Computer Studies 59(1-2) (2003)Google Scholar
  3. 3.
    Ben-Arie, J., Wang, Z., Pandit, P., Rajaram, S.: Human activity recognition using multidimensional indexing. IEEE Trans. On PAMI 24(8), 1091–1104 (2002)Google Scholar
  4. 4.
    Benali-Khoudja, M., Hafez, M., Alexandre, J.-M., Kheddar, A.: Tactile interfaces: a state-of-the-art survey. In: Int. Symposium on Robotics (2004)Google Scholar
  5. 5.
    Bobick, A.F., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. on PAMI 23(3), 257–267 (2001)Google Scholar
  6. 6.
    Brewster, S.A., Lumsden, J., Bell, M., Hall, M., Tasker, S.: Multimodal ’Eyes-Free’ Interaction Techniques for Wearable Devices. In: Proc. ACM CHI 2003 (2003)Google Scholar
  7. 7.
    Campbell, C.S., Maglio, P.P.: A Robust Algorithm for Reading Detection. In: ACM Workshop on Perceptive User Interfaces (2001)Google Scholar
  8. 8.
    Cohen, P.R., McGee, D.R.: Tangible Multimodal Interfaces for Safety-critical Applications. Communications of the ACM 47(1), 41–46 (2004)CrossRefGoogle Scholar
  9. 9.
    Cohen, I., Sebe, N., Cozman, F., Cirelo, M., Huang, T.S.: Semi-supervised learning of classifiers: Theory, algorithms, and their applications to human-computer interaction. IEEE Trans. on PAMI 22(12), 1553–1567 (2004)Google Scholar
  10. 10.
    Cohen, I., Sebe, N., Garg, A., Chen, L., Huang, T.S.: Facial expression recognition from video sequences: Temporal and static modeling. CVIU 91(1-2), 160–187 (2003)Google Scholar
  11. 11.
    Chen, L.S.: Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction, PhD thesis, UIUC (2000)Google Scholar
  12. 12.
    Duchowski, A.T.: A Breadth-First Survey of Eye Tracking Applications. Behavior Research Methods, Instruments, and Computing 34(4), 455–470 (2002)CrossRefGoogle Scholar
  13. 13.
    Dickie, C., Vertegaal, R., Fono, D., Sohn, C., Chen, D., Cheng, D., Shell, J.S., Aoudeh, O.: Augmenting and Sharing Memory with eyeBlog. In: CARPE 2004 (2004)Google Scholar
  14. 14.
    Duric, Z., Gray, W., Heishman, R., Li, F., Rosenfeld, A., Schoelles, M., Schunn, C., Wechsler, H.: Integrating perceptual and cognitive modeling for adaptive and intelligent human- computer interaction. Proc. of the IEEE 90(7), 1272–1289 (2002)CrossRefGoogle Scholar
  15. 15.
    Ekman, P. (ed.): Emotion in the Human Face. Cambridge University Press, Cambridge (1982)Google Scholar
  16. 16.
    Fagiani, C., Betke, M., Gips, J.: Evaluation of tracking methods for human-computer interaction. In: IEEE Workshop on Applications in Computer Vision (2002)Google Scholar
  17. 17.
    Fasel, B., Luettin, J.: Automatic facial expression analysis: A survey. Patt. Recogn. 36, 259–275 (2003)zbMATHCrossRefGoogle Scholar
  18. 18.
    Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially interactive robots. Robotics and Autonomous Systems 42(3-4), 143–166 (2003)zbMATHCrossRefGoogle Scholar
  19. 19.
    Fussell, S., Setlock, L., Yang, J., Ou, J., Mauer, E., Kramer, A.: Gestures over video streams to support remote collaboration on physical tasks. Human-Computer Int. 19(3), 273–309 (2004)CrossRefGoogle Scholar
  20. 20.
    Garg, A., Naphade, M., Huang, T.S.: Modeling video using input/output Markov models with application to multi-modal event detection, Handbook of Video Databases: Design and Applications (2003)Google Scholar
  21. 21.
    Garg, I., Pavlovic, V., Rehg, J.: Boosted learning in dynamic Bayesian networks for multimodal speaker detection. Proceedings of the IEEE 91(9), 1355–1369 (2003)CrossRefGoogle Scholar
  22. 22.
    Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. CVIU 73(1), 82–98 (1999)zbMATHGoogle Scholar
  23. 23.
    Hanjalic, A., Xu, L.-Q.: Affective video content representation and modeling. IEEE Trans. on Multimedia 7(1), 143–154 (2005)CrossRefGoogle Scholar
  24. 24.
    Hakeem, A., Shah, M.: Ontology and taxonomy collaborated framework for meeting classification ICPR. (2004)Google Scholar
  25. 25.
    Heishman, R., Duric, Z., Wechsler, H.: Using eye region biometrics to reveal affective and cognitive states. In: CVPR Workshop on Face Processing in Video (2004)Google Scholar
  26. 26.
    Hjelmas, E., Low, B.K.: Face detection: A survey. CVIU 83, 236–274 (2001)zbMATHGoogle Scholar
  27. 27.
    Hu, W., Tan, T., Wang, L., Maybank, S.: A Survey on Visual Surveillance of Object Motion and Behaviors. IEEE Trans. On Systems, Man, and Cybernetics 34(3) (2004)Google Scholar
  28. 28.
    Intille, S., Larson, K., Beaudin, J., Nawyn, J., Tapia, E., Kaushik, P.: A living laboratory for the design and evaluation of ubiquitous computing technologies, In: Conf. on Human Factors in Computing Systems (2004)Google Scholar
  29. 29.
    Jaimes, A., Liu, J.: Hotspot Components for Gesture-Based Interaction. In: proc. IFIP Interact 2005, Rome, Italy (September 2005)Google Scholar
  30. 30.
    El Kaliouby, R., Robinson, P.: Real time inference of complex mental states from facial expressions and head gestures. In: CVPR Workshop on Real-time Vision for HCI (2004)Google Scholar
  31. 31.
    Kettebekov, S., Sharma, R.: Understanding gestures in multimodal human computer interaction. Int. J. on Artificial Intelligence Tools 9(2), 205–223 (2000)CrossRefGoogle Scholar
  32. 32.
    Kirishima, T., Sato, K., Chihara, K.: Real-time gesture recognition by learning and selective control of visual interest points. IEEE Trans. on PAMI 27(3), 351–364 (2005)Google Scholar
  33. 33.
    Kisacanin, T., Pavlovic, V., Huang, T.S. (eds.): Real-Time Vision for Human-Computer Interaction. Springer, New York (2005)Google Scholar
  34. 34.
    Kuno, Y., Shimada, N., Shirai, Y.: Look where you’re going: A robotic wheelchair based on the integration of human and environmental observations. IEEE Robotics and Automation 10(1), 26–34 (2003)CrossRefGoogle Scholar
  35. 35.
    Lang, P.: The emotion probe: Studies of motivation and attention. American Psychologist 50(5), 372–385 (1995)CrossRefGoogle Scholar
  36. 36.
    Legin, A., Rudnitskaya, A., Seleznev, B., Vlasov, Y.: Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Analytica Chimica Acta, 534, 129–135 (2005)CrossRefGoogle Scholar
  37. 37.
    Lyons, M.J., Haehnel, M., Tetsutani, N.: Designing, playing, and performing, with a vision-based mouth Interface. In: Conf. on New Interfaces for Musical Expression (2003)Google Scholar
  38. 38.
    Marcel, S.: Gestures for multi-modal interfaces: A Review, Technical Report IDIAP-RR 02-34 (2002)Google Scholar
  39. 39.
    Maynes-Aminzade, D., Pausch, R., Seitz, S.: Techniques for interactive audience participation. In: ICMI 2002 (2002)Google Scholar
  40. 40.
    McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., Zhang, D.: Automatic analysis of multimodal group actions in meetings. IEEE Trans. on PAMI 27(3), 305–317 (2005)Google Scholar
  41. 41.
    McNeill, D.: Hand and Mind: What Gestures Reveal About Thought. Univ. of Chicago Press, Chicago (1992)Google Scholar
  42. 42.
    Mehrabian, A.: Communication without words. Psychology Today 2(4), 53–56 (1968)Google Scholar
  43. 43.
    Meyer, S., Rakotonirainy, A.: A Survey of research on context-aware homes, Australasian Information Security Workshop Conference on ACSW Frontiers (2003)Google Scholar
  44. 44.
    Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. CVIU 81(3), 231–258 (2001)zbMATHGoogle Scholar
  45. 45.
    Nielsen, J.: Non-command user interfaces. Comm. of the ACM 36(4), 83–99 (1993)CrossRefGoogle Scholar
  46. 46.
    Oudeyer, P.Y.: The production and recognition of emotions in speech: Features and algorithms. Int. J. of Human-Computer Studies 59(1-2), 157–183 (2003)CrossRefGoogle Scholar
  47. 47.
    Oulasvirta, A., Salovaara, A.: A cognitive meta-analysis of design approaches to interruptions in intelligent environments. In: Proceedings of ACM Conference on Human Factors in Computing Systems, CHI 2004 (2004) (Extended Abstracts)Google Scholar
  48. 48.
    Qvarfordt, P., Zhai, S.: Conversing with the user based on eye-gaze patterns. In: Conf. Human-Factors in Computing Syst. (2005)Google Scholar
  49. 49.
    Oviatt, S., Darrell, T., Flickner, M.: Multimodal Interfaces that Flex, Adapt, and Persist. Communications of the ACM 47 (1) (2004), special issueGoogle Scholar
  50. 50.
    Oviatt, S.L., Cohen, P.: Multimodal interfaces that process what comes naturally. Comm. of the ACM 43(3), 45–48 (2000)CrossRefGoogle Scholar
  51. 51.
    Oviat, S.L.: Multimodal interfaces. In: Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, ch. 14, pp. 286–304 (2003)Google Scholar
  52. 52.
    Oviatt, S.L., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J., Ferro, D.: Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions. Human-Computer Int. 15, 263–322 (2000)CrossRefGoogle Scholar
  53. 53.
    Pan, H., Liang, Z.P., Anastasio, T.J., Huang, T.S.: Exploiting the dependencies in information fusion. CVPR 2, 407–412 (1999)Google Scholar
  54. 54.
    Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of facial expressions: The state of the art. IEEE Trans. on PAMI 22(12), 1424–1445 (2000)Google Scholar
  55. 55.
    Pantic, M., Rothkrantz, L.J.M.: Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE 91(9), 1370–1390 (2003)CrossRefGoogle Scholar
  56. 56.
    Paradiso, J., Sparacino, F.: Optical Tracking for Music and Dance Performance. In: Gruen, A., Kahmen, H. (eds.) Optical 3-D Measurement Techniques IV, pp. 11–18 (1997)Google Scholar
  57. 57.
    Pavlovic, V.I., Sharma, R., Huang, T.S.: Visual interpretation of hand gestures for human- computer interaction: a review. IEEE Trans. on PAMI 19(7), 677–695 (1997)Google Scholar
  58. 58.
    Pelz, J.B.: Portable eye-tracking in natural behavior. J. of Vision 4(11) (2004)Google Scholar
  59. 59.
    Pentland, A.: Looking at People. Comm. of the ACM 43(3), 35–44 (2000)CrossRefGoogle Scholar
  60. 60.
    Pentland, A.: Socially Aware Computation and Communication. IEEE Computer 38(3) (2005)Google Scholar
  61. 61.
    Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997)Google Scholar
  62. 62.
    Porta, M.: Vision-based user interfaces: methods and applications. Int. J. Human-Computer Studies 57(1), 27–73 (2002)CrossRefGoogle Scholar
  63. 63.
    Reeves, L.M., et al.: Guidelines for multimodal user interface design. Communications of the ACM 47(1), 57–69 (2004)CrossRefGoogle Scholar
  64. 64.
    Rosales, R., Sclaroff, S.: Learning body pose via specialized maps. NIPS 14, 1263–1270 (2001)Google Scholar
  65. 65.
    Roth, P., Pun, T.: Design and evaluation of a multimodal system for the non-visual exploration of digital pictures. In: INTERACT 2003 (2003)Google Scholar
  66. 66.
    Ruddaraju, R., Haro, A., Nagel, K., Tran, Q., Essa, I., Abowd, G., Mynat, E.: Perceptual user interfaces using vision-based eye tracking. ICMI (2003)Google Scholar
  67. 67.
    Santella, A., DeCarlo, D.: Robust clustering of eye movement recordings for quantification of visual interest. Eye Tracking Research and Applications (ETRA), 27–34 (2004)Google Scholar
  68. 68.
    Sebe, N., Cohen, I., Huang, T.S.: Multimodal emotion recognition, Handbook of Pattern Recognition and Computer Vision. World Scientific, Singapore (2005)Google Scholar
  69. 69.
    Schapira, E., Sharma, R.: Experimental evaluation of vision and speech based multimodal interfaces. In: Workshop on Perceptive User Interfaces, pp. 1–9 (2001)Google Scholar
  70. 70.
    Schuller, B., Lang, M., Rigoll, G.: Multimodal emotion recognition in audiovisual communication. In: ICME (2002)Google Scholar
  71. 71.
    Selker, T.: Visual Attentive Interfaces. BT Technology Journal 22(4), 146–150 (2004)CrossRefGoogle Scholar
  72. 72.
    Sharma, R., Yeasin, M., Krahnstoever, N., Rauschert, I., Cai, G., Brewer, I., MacEachren, A., Sengupta, K.: Speech–gesture driven multimodal interfaces for crisis management. Proceedings of the IEEE 91(9), 1327–1354 (2003)CrossRefGoogle Scholar
  73. 73.
    Sibert, L.E., Jacob, R.J.K.: Evaluation of eye gaze interaction. In: Conf. Human-Factors in Computing Syst., pp. 281–288 (2000)Google Scholar
  74. 74.
    Smith, P., Shah, M., Lobo, N.d.V.: Determining driver visual zttention with one camera. IEEE Trans. on Intelligent Transportation Systems 4(4) (2003)Google Scholar
  75. 75.
    Simpson, R., LoPresti, E., Hayashi, S., Nourbakhsh, I., Miller, D.: The smart wheelchair component system. J. of Rehabilitation Research and Development (May/June 2004)Google Scholar
  76. 76.
    Sparacino, F.: The museum wearable: Real-time sensor-driven understanding of visitors. interests for personalized visually-augmented museum experiences. Museums and the Web (2002)Google Scholar
  77. 77.
    Trivedi, M.M., Cheng, S.Y., Childers, E.M.C., Krotosky, S.J.: Occupant posture analysis with stereo and thermal infrared video: Algorithms and experimental evaluation. IEEE Trans. on Vehicular Technology 53(6), 1698–1712 (2004)CrossRefGoogle Scholar
  78. 78.
    Turk, M.: Gesture recognition. In: Stanney, K. (ed.) Handbook of Virtual Environment Technology (2001)Google Scholar
  79. 79.
    Turk, M.: Computer vision in the interface. Communications of the ACM 47(1), 60–67 (2004)CrossRefGoogle Scholar
  80. 80.
    Turk, M., Robertson, G.: Perceptual Interfaces. Communications of the ACM 43(3), 32–34 (2000)CrossRefGoogle Scholar
  81. 81.
    Turk, M., Kölsch, M.: Perceptual Interfaces. In: Medioni, G., Kang, S.B. (eds.) Emerging Topics in Computer Vision, Prentice Hall, Englewood Cliffs (2004)Google Scholar
  82. 82.
    Wang, J.-G., Sung, E., Venkateswarlu, R.: Eye gaze estimation from a single image of one eye. In: ICCV, pp. 136–143 (2003)Google Scholar
  83. 83.
    Wang, L., Hu, W., Tan, T.: Recent developments in human motion analysis. Patt. Recogn. 36, 585–601 (2003)CrossRefGoogle Scholar
  84. 84.
    Wang, J.J.L., Singh, S.: Video analysis of human dynamics – A survey. Real-Time Imaging 9(5), 321–346 (2003)CrossRefGoogle Scholar
  85. 85.
    Wassermann, K.C., Eng, K., Verschure, P.F.M.J., Manzolli, J.: Live soundscape composition based on synthetic emotions. IEEE Multimedia Magazine 10(4) (2003)Google Scholar
  86. 86.
    Wu, Y., Huang, T.: Vision-based gesture recognition: A review. In: 3rd Gesture Workshop (1999)Google Scholar
  87. 87.
    Wu, Y., Hua, G., Yu, T.: Tracking articulated body by dynamic Markov network. In: ICCV, pp. 1094–1101 (2003)Google Scholar
  88. 88.
    Yang, M.-H., Kriegman, D., Ahuja, N.: Detecting faces in images: A survey. IEEE Trans. on PAMI 24(1), 34–58 (2002)Google Scholar
  89. 89.
    Yuan, Q., Sclaroff, S., Athitsos, V.: Automatic 2D hand tracking in video sequences. In: IEEE Workshop on Applications of Computer Vision (2005)Google Scholar
  90. 90.
    Yu, C., Ballard, D.H.: A multimodal learning interface for grounding spoken language in sensorimotor experience. ACM Trans. on Applied Perception (2004)Google Scholar
  91. 91.
    Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, J.: Face recognition: A literature survey. ACM Computing Surveys 12, 399–458 (2003)CrossRefGoogle Scholar
  92. 92.
    Salen, K., Zimmerman, E.: Rules of Play: Game Design Fundamentals. MIT Press, Cambridge (2003)Google Scholar
  93. 93.
    Zeng, Z., Tu, J., Liu, M., Zhang, T., Rizzolo, N., Zhang, Z., Huang, T.S., Roth, D., Levinson, S.: Bimodal HCI-related affect recognition. In: ICMI (2004)Google Scholar
  94. 94.
    Wu, Y., Huang, T.S.: Human hand modeling, analysis and animation in the context of human computer interaction. IEEE Signal Processing 18(3), 51–60 (2001)CrossRefGoogle Scholar
  95. 95.
    Murray, I.R., Arnott, J.L.: Toward the simulation of emotion in synthetic speech: A review of the literature of human vocal emotion. J. of the Acoustic Society of America 93(2), 1097–1108 (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Alejandro Jaimes
    • 1
  • Nicu Sebe
    • 2
  1. 1.FXPAL, Fuji Xerox Co., LtdJapan
  2. 2.University of AmsterdamThe Netherlands

Personalised recommendations