Aiden Bezzina
- Dec 5, 2023
- 5 min read

Are You Looking At This? A Closer Look at Eye-Contact Human-Machine Interaction Technology

Human-machine interaction (HMI) is described as the interaction and communication between human users and a machine via a user interface [1, 2]. User interfaces can take many different forms, ranging from a simple graphical user interface to a voice-based user interface. However, this article will solely focus on gesture-based user interfaces. As indicated by its name, gesture- based user interfaces allow human users to control machines through natural and intuitive behaviours [1]. One such example is the Microsoft Kinect, which is able to capture human motions, which are then processed to control a machine as desired [1]. More information on the Microsoft Kinect can be found in [3]. The aim of any gesture-based HMI is to recognize the meaningful expressions behind the human motions captured by the gesture-based user interface [1].

Why utilise eye contact in human-machine interaction applications?

Kleinke and Argyle et al. concluded that eye contact is a non-verbal social cue that portrays engagement and interest toward a target [4,5]. Consequently, a machine which is capable of detecting direct one-way eye contact, a phenomenon known as gaze locking, can lend itself to very powerful HMI [6]. One application for this is to support persons with disability in performing daily activities, by providing them with an assistive tool that facilitates their interaction with devices in their environment. Alternatively, it is also equally suited for integration into smart homes, improving accessibility and user experience all around.

A brief background on one-way eye contact detection methods or one-way eye contact interfaces.

In the one-way eye contact detection literature, two methods are considered as standard, namely active one-way eye contact detection techniques and passive one-way eye contact detection techniques. Active one-way eye contact detection techniques require direct interaction with the gazer for eye contact detection, thus necessitating special illumination, such as infrared (IR) [6]. In contrast, passive one-way eye contact detection techniques typically rely on computer vision techniques for one-way eye contact detection through analysing both video and image data, thus eliminating the need for special illumination, or hardware.

Up until the early 2010s, active methods were the established norm for one-way eye contact

techniques due to their superior performance. However, active methods typically relied on IR

illumination, which rendered them as quite impractical in real world scenarios. On the other hand, passive one-way eye contact detection required tedious feature engineering, which describes the process of manually designing or selecting distinguishable features.

In the early 2010s, the popularity of deep neural networks (NNs) hit its stride. As a result, it was not long until NNs were also applied to one-way eye contact detection tasks. In their simplest form, a NN can be described as a black box which outputs a label based on some given input features. Consequently, NNs allow the automatic discovery of complex features which are specifically tailored to the very task itself, thus lifting the feature engineering dependency of passive one-way eye contact detection techniques. This in turn led to active one-way eye contact detection techniques to become near obsolete.

Is there a common limitation shared amongst one-way eye contact detection techniques?

All one-way eye contact detection techniques share a common limitation, which is the reliance on target-centric cameras, where every device that it is to be activated required a dedicated camera mounted in its proximity. This limitation restricts the scalability of these techniques in scenarios involving multiple targets. Therefore, a gap exists for the development of a one-way eye contact technique that harnesses the advancements in passive one-way eye contact techniques, whilst being able to detect one-way eye contact with multiple targets without requiring additional equipment or hardware for the application of HMI.

Proposed idea

In the domain of HMI, human-to-device eye contact can be exploited to interact and control a device from a distance. However, a recurring constraint within the majority of prior art relating to general passive one-way eye contact detection is that of a target-centric camera, where every device requires its own locally-mounted camera/s. Naturally, this limits such one-way eye contact detection techniques from being scalable to multiple devices.

As a result, this work explored the detection of human-to-device eye contact with two different target devices from a 3rd view camera setup or one off-target camera, as depicted in the proceeding figure. By employing a 3rd view camera setup, this work aimed to perform human-to-device eye contact detection with an arbitrary number of devices, independent of the user.

More specifically, this work aimed to differentiate between instances of eye contact between

any user and both devices at four unique locations, represented by QA , QB , QC , and QD respectively as shown in Figure 1. An inductive transfer learning strategy throughout this work was opted for, whereby a pre-trained NN was utilised in order to bias previously learnt knowledge to automatically learn new unique and diverse features for the aforementioned task.

Showing the four quadrants in which a user may stay to operate two devices and the placement of the camera — Figure 1: Proposed camera and device layout. QA, QB, QC and QD are four quadrants where the user is expected to be located.

Development of a one-way eye contact technique

This work was effectively divided into two separate tasks. The first task comprised of curating

a specialised dataset accustomed to the aforementioned task due to the absence of readily available datasets. The second task decomposed the one-way eye contact detection problem, into two separate stages, namely, user-dependent learning and user-independent learning. The user-dependent learning stage assessed one-way eye contact detection abilities per distinct user per specific location in the immediate environment. Upon user-dependent learning succession, the learning task was scaled up to assess one-way eye contact detection abilities independent of the user.

The results suggest that the user-dependent learning was deemed successful since one-way eye contact instances per user per specific location were successfully differentiated. In contrast, the results also suggest that the user-independent learning state was not deemed a success, which therefore implies a lack of user generalisation. This was attributed to a lack of diversity and the limited size of the curated dataset. However, upon including a small data subset of the user in the dataset itself, the one-way eye contact detection performance drastically improved. This approach bears a resemblance to few-shot learning techniques, where models learn from a small set of examples to adapt to new tasks.

Further developments

This work investigated one-way eye contact detection abilities from a 3rd camera setup, whereby the aim was to generalise across users. Therefore, building upon this foundation, future research can focus on employing the same approach to try and generalise across user

location, and subsequently scaling it up further to generalise both across different users and

user locations, thus making it more applicable to real word scenarios.

A notable improvement would be the expansion of the developed dataset, which can further

enrich the extracted features as well as make the features more robust to variation. In addition, one may also investigate the same problem from a video data type perspective, as opposed to an image data type perspective. This in turn would provide temporal information as well as further contextual information, thus resulting in richer features which may lead to better generalisation.

In addition, it would be worth applying deeper and more complex pre-trained NN, which have a larger capacity to capture intricate patterns and relationships within the input data, thus can potentially lead to a better performance and generalisation.

References

Q. Ke, J. Liu, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “Chapter 5 - Computer Vision for Human–Machine Interaction,” ScienceDirect, Jan. 01, 2018. Available: https://shorturl.at/dyU27
H. D. Unbehauen, "Control Systems Robotics and Automation – Volume XXI: Elements of Automation". EOLSS Publications, 2009. Accessed: Nov. 05, 2023. Available: https://shorturl.at/cmrvI
Wikipedia Contributors, “Kinect,” Wikipedia, May 07, 2019. https://en.wikipedia.org/wiki/Kinect
Kleinke, C.L.: Gaze and eye contact: a research review. Psychological bulletin 1001, 78–100 (1986)

Aiden Bezzina completed the MSc in Signals, Systems and Control in 2023. The work described in this article reflects the work carried out in the project dissertation and was supervised by Dr Stefania Cristina and Prof. Kenneth Camilleri.