Visio-Auditive Perception for the recognition of Human activities

Our research has concerned from more than 8 years, the detection and recognition of people as well as the interpretation of human motions, from visual and auditory sensors.
Our work in visual perception of humans has dealt with the tracking and interpretation of human movements. We have prototyped and evaluated functions robust to the environmental conditions and computationally cheap.
Within the newly launched research in robot audition, an auditory sensor has been built about the generation of “acoustic maps” of the environment and the extraction of sources, which fits the original constraints raised by robotics.

Visual perception of humans (PhDs : L.Brethes, 2006 ; M.Fontmarty, 2008)

For human motions tracking and their interpretation, probabilistic graphical models as well as particle filtering, because of its ability to easily and rigorously integrate diverse sensory percepts, are here considered. Several trackers were prototyped and evaluated, based on the selection of some visual cues, the probabilistic combination of person detection/identification modules within advanced filtering strategies. This procedure has been applied to appearance based person 2D tracking [MVA journal, 2008], as well as to the 3D tracking of either the whole body or body parts from embedded or fixed cameras [ICPR2008]. Besides, further investigations have coped with the multimodal fusion of vision and speech [IROS2008] for the interpretation of speech commands parameterized by gestures e.g. deictic gestures. Implementations on mobile platforms dedicated to interaction have enabled the validation of most of our visual functions on robotics scenarios and have underlined their complementariness.

Hereafter a figure on the interpretation of multimodal commands by the robot (left: hands and head tracking; center: synthetic top view; right: experiment). Three modules have been integrated on our JIDO robot, RECO (speech), GEST (vision) and MHP (manipulation involving humans).



Robot audition: sound source localization (PhD : S.Argentieri, 2006)

A fully programmable integrated auditory sensor has been developed based on a linear array of 8 microphones, a dedicated acquisition chain and FPGA-based processing. The first aim was to compute acoustic maps of the environment at a rate of 15Hz, and to spatially filter sources out of the ambient noise [IROS2009]. Constraints raised by Robotics are: embeddability (size, energy), real time, wideband sources (e.g. at least the bandwidth [300Hz;3000Hz] for voice), farfield or nearfield assumption, as well as ambient noise and reverberation. Following a localization strategy in azimuth and distance based on an original broadband farfield-or-nearfield beamforming algorithm [IROS2006], a second approach was proposed, combining this algorithm with a broadband extension of the high-resolution MUSIC (MUltiple SIgnal Classification) method [IROS2007].

Hereafter a figure with on the left,, the Acoustic sensor, and on the right, a typical MUSIC pseudospectrum as a function of the source (range, azimuth), showing sharp peaks at the source locations (here two sources, a voice and an incidental music).

On-going PhDs

  • B.Burger, on gesture recognition
  • I.Zuriarrain (in Mondragon University, Spain), on visual monitoring
  • T.Germa, on human detectio, tracking and identification from embredded sensors.
  • W.Filali, on integrated sensors for visual monitoring
  • J.Bonnal, on the acoustic sensor