See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/359386075
Eye gaze estimation: A survey on deep learning-based approaches
Article  in  Expert Systems with Applications · March 2022
DOI: 10.1016/j.eswa.2022.116894
CITATIONS
12
READS
554
4 authors, including:
Some of the authors of this publication are also working on these related projects:
Software Architecture - Engineering Methods and Tools View project
An Interactive Workflow Solution to Support Bioinformatics Analyses View project
Shashimal Senarath
University of Moratuwa
4 PUBLICATIONS   23 CITATIONS   
SEE PROFILE
Dulani Meedeniya
University of Moratuwa
124 PUBLICATIONS   997 CITATIONS   
SEE PROFILE
Sampath Jayarathna
California State Polytechnic University, Pomona
76 PUBLICATIONS   899 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Dulani Meedeniya on 26 June 2022.
The user has requested enhancement of the downloaded file.
Eye Gaze Estimation: A Survey on Deep Learning-Based Approaches
Primesh Pathiranaa, Shashimal Senarathb, Dulani Meedeniya c and Sampath Jayarathna d,∗
aPRIMESH PATHIRANA, Department of Computer Science and Engineering, University of Moratuwa, Moratuwa, 10400, Sri Lanka
bSHASHIMAL SENARATH, Department of Computer Science and Engineering, University of Moratuwa, Moratuwa, 10400, Sri Lanka
cDULANI MEEDENIYA, Department of Computer Science and Engineering, University of Moratuwa, Moratuwa, 10400, Sri Lanka
dSAMPATH JAYARATHNA, Department of Computer Science, College of Science, Old Dominion University, Norfolk 23529, USA
ABSTRACT
Human gaze estimation plays a major role in many applications in human-computer interaction
and computer vision by identifying the users’ point-of-interest. The revolutionary developments
of deep learning have captured significant attention in the gaze estimation literature. Gaze
estimation techniques have progressed from single-user constrained environments to multi-
user unconstrained environments with the applicability of deep learning techniques in complex
unconstrained environments with extensive variations. This paper presents a comprehensive
survey of the single-user and multi-user gaze estimation approaches with deep learning. The
state-of-the-art approaches are analyzed based on deep learning model architectures, coordinate
systems, environmental constraints, datasets and performance evaluationmetrics. A key outcome
from this survey realizes the limitations, challenges, and future directions of multi-user gaze
estimation techniques. Furthermore, this paper serves as a reference point and a guideline for
future multi-user gaze estimation research.
1. Introduction
Eye gaze plays an important role in identifying the users’ point of interest in terms of the direction and location,
attention, emotions and interactions. Generally, human gaze estimation is a frequently used approach to get a better
understanding of human cognition and behaviour. Many studies have addressed the approaches to trace the position
and direction of eye gaze, which is required for different domains such as cognitive (Chong, Wang, Ruiz & Rehg,
2020), social behaviour (Kodama, Kawanishi, Hirayama, Deguchi, Ide, Murase, Nagano & Kashino, 2018; Sugano,
Zhang&Bulling, 2016), medical health (De Silva, Dayarathna, Ariyarathne,Meedeniya, Jayarathna&Michalek, 2021;
De Silva, Dayarathna, Ariyarathne, Meedeniya, Jayarathna, Michalek & Jayawardena, 2019), commercial Bermejo,
Chatzopoulos & Hui (2020); Sugano et al. (2016) and other Human computer interaction applications (Zhang, Sugano,
Fritz & Bulling, 2015). Additionally, gaze estimation environments can be classified as constrained/controlled or
unconstrained/wild. Constrained environments are those that have a fixed set of parameters such as illumination, subject
count, head-angle variation. On the other hand, unconstrained environments are those with a considerable measure of
parameter variation. It is clear that with the widespread use of gaze estimating technology across many application
domains, gaze estimation has progressed more into unconstrained environments surpassing constrained environment
settings.
Although several eye gaze estimation solutions are available, some of them endure aspects such as expensiveness,
require manual interventions, unreliability and inaccuracy in practical deployments. Also, the performance of some
traditional approaches is limited by factors such as low image quality and light conditions. In such scenarios, Deep
Learning (DL) based eye gaze estimation approaches come into play due to the inherited benefits such as learning from
existing data, automation, flexible process, high accuracies, and better decision making. These prevalence DL based
approaches have shown success in performance improvements in eye gaze applications.
Human gaze estimation approaches fall into two broad categories such that model-based techniques and
appearance-based techniques. Model-based methods fundamentally require dedicated devices such as near-infrared
(NIR) cameras to manually regress the eye features and build a geometric model (Cheng, Wang, Bao & Lu, 2021;
Kar & Corcoran, 2017). This method is person-specific and restricted to constrained environments (Cheng et al., 2021;
∗Corresponding author.
thumbnails/cas-email.jpegprimeshs.17@cse.mrt.ac.lk (P. Pathirana); shashimalsenarath.17@cse.mrt.ac.lk (S. Senarath); dulanim@cse.mrt.ac.lk
(D.M. ); sampath@cs.odu.edu (S.J. )
ORCID(s):
First Author et al.: Preprint submitted to Elsevier Page 2 of 29
Short Title of the Article
Akinyelu & Blignaut, 2020). In comparison, appearance-based techniques do not necessitate dedicated devices and are
not limited to constrained environments. These methods can be subdivided into two categories namely conventional
appearance-based methods and appearance-based methods with deep learning.
Over the last decade, eye-tracking literature has seen a surge of interest in gaze estimating methods based on deep
learning techniques due to their applicability and robustness in unconstrained environments. In contrast to conventional
appearance-based methods, deep learning-based methods exhibit many benefits, such as the ability to extract high-level
gaze features from images and the ability to lean a non-linear mapping function directly from the image to eye gaze
(Cheng et al., 2021; Kellnhofer, Recasens, Stent, Matusik & Torralba, 2019). Deep convolutional neural networks
(DCNN) have been utilized in almost every deep learning-based gaze estimation approach due to their ability to map
image features directly, handle large-scale datasets, learn complex non-linear mappings when faced with significant
head-pose variations, eye occlusions, and illumination conditions.
Appearance-based methods with deep learning, which is the main focus of this study, can be divided further into
two subcategories based on the number of subjects namely single-user gaze estimation and multi-user gaze estimation.
Despite the significant shift in gaze estimation techniques towards applications in unconstrained environments, the
demand for multi-user gaze estimation approaches is on the rise. As of the year 2021, a limited number of such methods
have been researched by time-shifting and space-shifting single-user gaze estimation.
This survey paper explores state-of-the-art methods and techniques used in eye gaze estimation research. We
analyse the use of the latest deep learning (DL) techniques, useful public datasets and different approaches used by
related studies. The lessons learned from this survey states that eye gaze applications are evolving with the use of DL
techniques due to the inherited benefits. Moreover, this study suggests guidance to follow a DL based process for eye
gaze estimation that can be used as a reference. Further, we discuss the challenges and future research directions in eye
gaze estimation in several applications. Thus, we aim to inspire the researchers and developers with useful insights to
produce effective and efficient eye gaze estimation applications using DL techniques.
Figure 1 states the survey structure considered for this article, focusing on the single and multi-user gaze estimation
methods in deep learning. Section 1 states the survey motivation and main contributions of this research. Section 2
explains the scope of the survey and discusses the background of current related studies. Section 3 provides an overview
of multi-user gaze estimation by discussing the history, progression, and applications of gaze estimation. Section 4
broadly discusses the technical aspects of existing gaze estimation approaches, focusing on appearance-based methods
with deep learning. In Section 5, Section 6, and Section 7, we present the supplementary knowledge in gaze estimation
literature by stating the theoretical concepts behind the coordinate systems, describing different gaze datasets, and
stating the state-of-the-art performance evaluation metrics. Section 8 elaborates and critically analyzes the existing
single-user and multi-user gaze estimation approaches by summarizing their key outcomes and limitations. Section 9
suggests guidance to select a given approach based on different conditions and discusses the limitations, challenges,
future direction in multi-user gaze estimation literature. Finally, Section 10 concludes the study.
2. Background
2.1. Related Work
Among many studies that have focused on eye gaze estimation research, only a few survey studies are available that
discusses growing aspects in the literature that is focused on DL techniques. Table 1 summarizes the features addressed
by the existing related survey papers. Some of the studies have discussed different gaze estimation approaches such
as model-based methods, appearance-based methods, deep learning-based methods and convolutional neural network
(CNN) based methods. For instance, Kar & Corcoran (2017) have explored these methods focusing on model-based
approaches. They have presented their work under five categories, 1) 2D regression, 2) 3Dmodel, 3) Appearance-based,
4) Cross Ratio-based, and 5) Shape based methods. Similarly, Cazzato, Leo, Distante & Voos (2020) have surveyed
gaze estimation techniques under two categories, 1) Geometric-based, and 2) Appearance-based methods, by analyzing
the advancements in computer vision such as deep learning. In another point of view Akinyelu & Blignaut (2020) and
Cheng et al. (2021) have shown different deep learning-based gaze estimation techniques focusing on CNNs. Many of
these studies have further reviewed the calibration techniques, performance evaluation metrics, devices and platforms,
and datasets in the gaze estimation literature. However, most of the studies have not discussed these approaches in
a multi-user gaze estimation perspective considering the factors such as unconstrained environmental settings, gaze
target variations, and coordinate systems.
First Author et al.: Preprint submitted to Elsevier Page 3 of 29
Short Title of the Article
Figure 1: Structure of the paper.
Table 1: Summary of related survey papers
Consideration Survey
Kar 2017 Cheng 2021 Akinyelu 2020 Cazzato 2020 Klaib 2021
Model-based method ✓ ✓
Appearance-based methods ✓ ✓ ✓ ✓ ✓
DL-based methods ✓ ✓ ✓ ✓ ✓
Calibration techniques ✓ ✓
Datasets ✓ ✓ ✓
Performance evaluation metrics ✓ ✓ ✓ ✓
Devices and platforms ✓ ✓ ✓ ✓
2.2. Scope of the Survey
This paper provides a comprehensive survey of single andmulti-user gaze estimationmethods in deep learning from
2015 to 2021. The related studies are surveyed from four perspectives, 1) deep neural network model architecture, 2)
datasets, 3) environment, and 4) performance evaluation. From the deep neural networkmodel architecture perspective,
First Author et al.: Preprint submitted to Elsevier Page 4 of 29
Short Title of the Article
we review the Deep Learning-based approaches such as multi-task CNNs, temporal and spatial CNNs, and capsule
networks. Network backbones, inputs, and outputs, optimization techniques are further discussed. From the datasets
perspective, metadata such as the number of images, subject variations, annotation formats, and image quality are
discussed. The environment perspective describes the coordinate systems used, head-pose variations, illumination
variations, and other application-specific environmental parameters. Finally, we review and compare the acquired
performance aspects. Following are the highlights of the survey paper.
• Present an in-depth analysis of the deep learning-based gaze estimation approaches from 2015 to 2021 with a
focus on multi-user gaze estimation techniques in unconstrained settings.
• Provide a survey on existing state-of-the-art single-user and multi-user large-scale gaze datasets. Requirements
for a standard multi-user gaze dataset, a summary of public and synthetic gaze datasets, and issues related to
public gaze datasets are discussed and analyzed.
• Explain the theory behind coordinate systems and the possible performance evaluation metrics that can be
applied on eye gaze estimation.
• Suggest a guidance for selecting deep learning based approaches in eye gaze estimation for researchers and
developers. Discuss the open challenges and future opportunities in the field of deep learning-based multi-user
gaze estimation.
2.3. Evolution of the techniques
Figure 2 shows a quantitative view of the use of the techniques in the related literature between the year 2015 -
2020. We have considered the research papers indexed in Google Scholar for each of the techniques in the related
studies. Our search strategy is based on “<technique name>” + “<research consideration>”. Although the considered
data can vary slightly due to the search query’s associated noise, we assume the flaws are equally distributed over the
search results for all the considered techniques. Thus, the audience can get a comparative view of the usage of the main
techniques in this area.
As shown in Figure 2, there is a similar growth in AlexNet, VGG (Visual Geometry Group), and Inception
techniques in the year 2016 to 2020. In another point of view, the residual neural network (ResNet) technique has shown
a rapid increase in popularity. However, LeNet has decreased its usage, which may be due to the recent advancements
in Residual networks. Overall, it can be seen that the interest in gaze estimation research with deep CNNs is steadily
increasing irrespective of the type of technique.
Figure 2: Evolution of deep Learning-based gaze estimation techniques.
First Author et al.: Preprint submitted to Elsevier Page 5 of 29
Short Title of the Article
3. Overview of Multi-User Gaze Estimation
3.1. Eye Gaze
Human eye gaze is an active natural form of interaction that gathers information from a visual scene. It provides a
wealth of information about human actions even though eye gaze is subtle and straightforward in comparison to gesture
and speech.
In eye gaze research, eye movements are studied thoroughly based on their type, functionality, and characteristics.
Analysis of eye movements are used to gather data about the user’s intention, cognitive activities, and attention
(Velichkovsky, Rumyantsev &Morozov, 2014; Goldberg & Kotval, 1999; De Silva et al., 2021). These eye movements
are broadly classified as fixations, saccades, smooth pursuit, scanpath, gaze duration, blink, and pupil size change (Kar
& Corcoran, 2017). Fixations are times when eyes are stationary between movements and scan a scene. They have
the least movement rate and are helpful for scanning detailed information, reading, and attention. Saccades, on the
other hand, have the highest movement rates and are helpful for visual search. These are simultaneous movements of
both eyes that occur between fixations. Smooth pursuits are eye-tracking movements used to follow moving targets of
interest. Scanpath is a combination of alternating eye fixations and saccades prior to the eyes reach a target position.
The dimensionality of eye gaze can be classified as 2D gaze and 3D gaze. 2D eye gaze can be calculated using gaze
direction from a single eye, while calculating the 3D eye gaze requires both gaze direction and gaze depth from both
eyes (Kwon, Jeon, Ki, Shahab, Jo & Kim, 2006).
3.2. Gaze Estimation
Gaze estimation is an umbrella term used to assess human intent and interest through the measurement of human
eye gaze (Tsukada, Shino, Devyver & Kanade, 2011). The history of human gaze estimation and eye-tracking dates
back to the 18th century where researchers used invasive eye-tracking techniques to observe eye movements (Khan
& Lee, 2019; Kar & Corcoran, 2017). However, with the developments in digital signal processing and computer
vision fields, more and more non-invasive gaze estimation approaches have been adopted by utilizing unique, physical
characteristics of the eye (Khan&Lee, 2019; Chennamma&Yuan, 2013; Kar &Corcoran, 2017). The photometric and
motion characteristics of the human eye have provided essential features required for this task (Akinyelu & Blignaut,
2020; Khan & Lee, 2019).
Gaze direction and point of gaze are two metrics used for gaze estimation. The visual axis, which deviates from
the optical axis, determines the gaze direction (Kar & Corcoran, 2017) as shown in the Figure 3. Eye properties
such as pupil and corneal reflection derived from eye regions, which are used to determine it in the application level
(Chennamma & Yuan, 2013). Subsequently, gaze point is defined as the intersection of the of gaze direction and the
object’s surface (Sun, Sun, Guo, Jia & Sun, 2016).
Figure 3: Model of a human eye ball.
Before the emerge of computer vision-based methods, gaze estimation techniques relied on detecting patterns of
eye movement such as fixations, saccades, and smooth pursuits (Young & Sheena, 1975). Methods based on computer
vision can be classified into three groups, (1) 2D eye feature regression methods, (2) 3D eye model recovery method,
and (3) Appearance-based methods (Cheng et al., 2021). These methods estimate the gaze using eye image and video
data and the eye’s geometric model characteristics. Specifically, the first two approaches detect geometric features of
the eye such as corneal reflection and pupil center and build an eye model to estimate gaze (Cheng et al., 2021; Kar &
First Author et al.: Preprint submitted to Elsevier Page 6 of 29
Short Title of the Article
Corcoran, 2017). Coherently these two approaches are referred to as model-based approaches in the literature. The third
strategy makes use of the eye’s photometric appearance to estimate gaze (Chennamma & Yuan, 2013). Model-based
methods require the assistance of dedicated devices such as infrared cameras, while methods based on appearance do
not require such specialized instruments for gaze measurement.
Generally, there are two types of devices used in these methods: (1) remote eye tracker and (2) head-mounted eye
tracker where the first type is typically kept at a distance of 60cm from the user and the cameras on the second type are
commonly installed on a frame of glass (Cheng et al., 2021). The user interfaces for gaze estimation are categorized
into four groups as active, passive, single, or multi-modal (Špakov & Miniotas, 2005; Sibert & Jacob, 2000; Kumar,
Paepcke & Winograd, 2007b). Active interfaces utilize the user’s gaze to activate a function, while passive interfaces
uses gathered gaze data to determine a user’s level of interest or attention (Kar & Corcoran, 2017).
Depending on the coordinate system used, gaze estimation techniques can be divided into 2D gaze estimation and
3D gaze estimation. The vast majority of work has been proposed for 2D gaze estimation, while a few studies have
focused on 3D gaze for accurate gaze estimation in real-world settings (Sugano et al., 2016; Kodama et al., 2018).
3.3. Multi-User Gaze Estimation
With the rapid utilization of deep learning-based approaches in gaze estimation techniques in the last decade, a
growing interest in gaze estimation in unconstrained environments has been noticed. The concept of multi-user gaze
estimation has been studied and applied in various application domains due to this adaptation (Kodama et al., 2018;
Kellnhofer et al., 2019; Bermejo et al., 2020). In contrast to conventional single-user gaze estimation, multi-user gaze
estimation is mostly required in open environmental settings such as retail, public gatherings, and public venues. Hence,
it requires robust, low-overhead, and high-speed gaze estimation approaches.
Existing multi-user gaze estimation studies can be split into two categories as time-sharing approaches and space-
sharing approaches (Kodama et al., 2018; Sugano et al., 2016). The time-sharing method distributes the number of
users over a time period. On the other hand, the space sharing approach process multiple users at the same time. In
literature, time-shifting approaches have not captured much attention due to their unscalability, and fewer robustness
(Park, Jain & Sheikh, 2012; Park & Shi, 2015).
3.4. Applications of Gaze Estimation
Gaze estimation is becoming an increasingly effective technique in a variety of fields such as computer vision,
medical diagnosis, autonomous vehicles, psychology, human-computer interaction, and sports training (Kerr-Gaffney,
Harrison & Tchanturia, 2019; Raptis, Katsini, Belk, Fidas, Samaras & Avouris, 2017; De Silva et al., 2021, 2019;
Sugano et al., 2016; Zhang, Sugano & Bulling, 2019a; Wang, Pi, Qin, Shen & Shi, 2018a; Wang, Dong, Chen &
Shi, 2015). Through eye gaze estimation, valuable information of human behavior such as the object of concentration,
internal cognitive state, user intent, and attention analysis can be inferred (Kar & Corcoran, 2017). Eye tracking and
gaze estimation were limited to psychological and cognitive studies and medical research in the early stages. But with
technological breakthroughs in computing power, digital video processing, and low-cost hardware, applications in
gaze estimation have grown into new domains such as gaming, virtual reality, and web advertising (Kar & Corcoran,
2017; Morimoto & Mimica, 2005). In human-computer interaction, gaze location can be used as an input modality to
supplement other primary modalities such as a mouse, keyboard, and touch. Eye movements reflect the cognition
process of a human, as well as the medical and mental condition of that person, which can be used in multiple
applications (Guojun & Saniie, 2016).
Kar & Corcoran (2017) have classified the types of devices in which single-user gaze estimation is used into five
broad categories as, desktop-based systems, television and large display panels, Head-mounted setups, Automotive,
and Hand-held devices (smartphones and tablets). In desktop-based systems, gaze estimation is used for computer
communication such as mouse pointer control, gaze-based object selection, password entry, and psychoanalysis (Sibert
& Jacob, 2000; Zhai, Morimoto & Ihde, 1999; Ghani, Chaudhry, Sohail & Geelani, 2013; Kasprowski & Harężlak,
2014; Kumar, Garfinkel, Boneh & Winograd, 2007a). In television and large display, the panels gaze estimation can
be applied for navigating menus, modifying display properties in TVs, switching channels, and understanding user
interests (Gwon, Cho, Lee, Lee & Park, 2013; Lee, Luong, Cho, Lee & Park, 2010). Gaze trackers installed on the
head are commonly employed in portable platforms and have a variety of uses in domains such as augmented reality,
virtual reality, sports training, computer gaming, and psychological research (Lee, Lee & Choi, 2011; Lee, Ko & Park,
2009; Piumsomboon, Lee, Lindeman & Billinghurst, 2017; Thies, Zollhöfer, Stamminger, Theobalt & Nießner, 2018;
Sidorakis, Koulieris & Mania, 2015). In automotive systems, gaze estimation is vital for driver alertness detection,
First Author et al.: Preprint submitted to Elsevier Page 7 of 29
Short Title of the Article
driver fatigue detection, and cognitive state estimation (Ji, Zhu & Lan, 2004; Sun, Xu & Yang, 2007; Zheng, Nakano,
Ishiko, Hagita, Kihira & Yokozeki, 2015). In the context of hand-held devices, smartphone and tablet interaction has
been immensely improved with the assistance of gaze estimation for tasks such as controlling the device, gaze-based
user authentication, and keyboard typing (Liu, Dong, Gao & Wang, 2015; Velichkovsky et al., 2014).
Consequently, while single-user gaze estimation has expanded to a broad range of domains and applications, multi-
user gaze estimation still is a novel concept at the research level.
4. Gaze estimation approaches
Existing gaze estimation approaches are classified into two broad categories: appearance-based techniques and
model-based techniques. Model-based gaze estimation techniques make use of a geometric model of the eye that
includes a number of ocular components such as the cornea, optical, and visual axes. While model-based gaze
estimation methods are more precise, they typically require time-consuming personal calibration for each participant.
Appearance-based methods usually require user eye appearance images to directly learn a mapping function from
eye appearance image to gaze estimation (Kellnhofer et al., 2019; Xu, Ehinger, Zhang, Finkelstein, Kulkarni & Xiao,
2015; Huang, Veeraraghavan & Sabharwal, 2017; Fischer, Chang & Demiris, 2018). Appearance-based methods
typically do not require camera calibration and geometry data since the mapping is made directly on the image of the
user’s eye. Appearance-based methods can be divided into two categories as conventional appearance-based methods
and appearance-based methods with deep learning, and their abstract concepts are depicted in Figure 4 and Figure 5,
respectively.
Figure 4: Conventional appearance based methods.
Figure 5: Appearance based methods with deep learning.
4.1. Conventional Appearance Based Methods
Conventional appearance-based approaches treat whole images as features and deduce eye gaze directly from them.
Conventional appearance-based methods have used mapping functions such as Adaptive linear regression, K-Nearest-
Neighbor, Random forest regression, Artificial Neural Networks, Gaussian Processers, Support Vector Machines.
Lu, Sugano, Okabe & Sato (2014b) have proposed adaptive linear regression (ALR) technique for mapping high-
dimensional features of the ocular image to low-dimensional gaze positions, which significantly reduces the number of
training samples for high accuracy estimation. k-Nearest Neighbors has become a standard method in the conventional
First Author et al.: Preprint submitted to Elsevier Page 8 of 29
Short Title of the Article
appearance-based method for predicting gaze using the mean of neighbor samples’ gaze angles. Wang, Zhao, Ding,
Peng, Bian & Fu (2018b) have presented a gaze estimation framework that is a combination of neighbor selection
and neighbor regression. It makes extensive use of information about the head’s position, the pupil center, and the
appearance of the eyes. Kacete, Séguier, Collobert & Royan (2016) have proposed an approach based on an ensemble
of trees grouped in a single forest to learn the highly non-linear mapping function between the gaze information and
the RGB eye image appearances, including depth cues. Yu, Xu & Huang (2016) have proposed a method based on
particle swarm optimization BP neural network. These methods suffer from many challenges. Most Conventional
appearance-basedmethods require a fixed head pose or a limited range of headmovements as represented in Figure 6(a).
Furthermore, this method has difficulties handling subject differences, especially in the unconstrained environment.
Figure 6: Constrained environment and Unconstrained environment
4.2. Appearance Based Methods with Deep Learning
In computer vision, it has been demonstrated that deep learning techniques outperform earlier state-of-the-art
machine learning techniques. Recently, research on gaze estimation has concentrated on methods based on deep
learning. They have the ability to overcome the challenges such as significant head motion, subject differences, and
unconstrained environmental settings as represented in Figure 6(b). CNNs are the most widely used algorithm in this
regard. An in depth discussion on appearance based methods with deep learning is presented in the Section 8.
5. Coordinate systems
This section addresses the main types of coordinate systems that have been addressed in the literature on gaze
estimation. Mainly the coordinate systems can be categorized as, 1) Image coordinates, 2) Subject and camera
coordinates, and 3) Screen coordinates, as shown in Figure 7.
Figure 7: Coordinate systems used for 2D and 3D gaze estimation; (a) Image coordinates, (b) Subject and camera
coordinates, (c) Screen coordinates.
Image Coordinate System: Image coordinate system is a 2D coordinate systemwhich enables to specify a location
in a 2D image (Recasens, 2016; Chong et al., 2020; Fang, Tang, Shen, Shen, Gu, Song & Zhai, 2021). There are two
types of image coordinates, pixel coordinate and spatial coordinate. The image is treated as a grid composed of discrete
elements in the pixel coordinates, ordered from top to bottom and left to right. Spatial coordinates provide for more
First Author et al.: Preprint submitted to Elsevier Page 9 of 29
Short Title of the Article
precise location specification in an image than pixel coordinates do, and they describe image positions in terms of partial
pixels. Image coordinate systems are especially used in gaze-following systems (Recasens, 2016). In gaze-following, a
single image contains one or more people, and each person’s center of the eyes location, head location, location of gaze
point, and pixel or spatial coordinate system use for annotating these locations in the image. In addition, some datasets
contain object’s bounding boxes, segmentation masks, and other boundaries, and these are also annotated using the
image coordination system (Tomas, Reyes, Dionido, Ty, Mirando, Casimiro, Atienza & Guinto, 2021). Figure 7(a)
depicts the standard image coordinate system.
Subject and Camera Coordinate System: Subject coordinates represent the coordinate of the world from the
perspective of the user’s eyes (Kellnhofer et al., 2019; Bermejo et al., 2020). Camera coordinates share an origin with
the subject coordinate system, but the coordinate axis orientation may be different as shown in the Figure 7(b). The
coordinates of a camera are expressed in terms of points with the origin at the optical center of the camera. Subject
coordinates and Camera coordinate systems are 3D coordinate systems and specially used in gaze direction estimation
systems to target positioning, express the gaze orientation.
Screen Coordinate System:When using an eye tracker with a screen(gaze point estimation), all gaze estimations
are mapped into a screen coordinate system (Sugano et al., 2016; Kar & Corcoran, 2017). This two-dimensional
coordinate system corresponds to the physical coordinates of pixels on the computer screen based on the current screen
resolution. The origin of the screen coordinate system is the screen’s top left corner, and the point (0, 0) signifies the
screen’s upper left corner, while (1, 1) denotes the screen’s bottom right corner as shown in the Figure 7(c).
6. Data sets
6.1. Requirements for real-world benchmark datasets
The general requirements for a real-world benchmark dataset for multi-user gaze estimation can be listed as follows.
Environment Different light conditions (Kellnhofer et al., 2019) that exist in unconstrained environments should
be captured to improve the generality of the dataset. Bright light, night light, dawn, dusk and shadows are few of
the varying illumination conditions under which the images should be captured. The dataset should include a broad
diversity in scenarios such as different head poses, body poses, in-frame gaze points, out-frame gaze point (Kellnhofer
et al., 2019; Chong et al., 2020). Furthermore, these scenarios should be captured with different backgrounds patterns
and textures.
Target variation In the gaze estimation literature, a variety of targets such as gaze point, gaze direction, and gazed
object has been studied. 2D and 3D gaze points are required by multiple applications such as desktop scenarios and
public displays (Recasens, 2016; Sugano et al., 2016). 2D and 3D gaze directions are required to calculate the respective
gaze points (Fang et al., 2021). Gazed object is a target associated with the novel concept of gaze object prediction,
which requires annotating gazed object bounding boxes (Tomas et al., 2021). A multi-user perspective of these targets
is a necessary requirement in a multi-user benchmark dataset.
Subject variation Substantial subject variations should be captured by considering the aspects such as collecting
images with a sufficient number of subjects, male and female subjects, subjects from different regions of the world
representing different skin colors, face and eye shapes, etc (Kellnhofer et al., 2019; Tomas et al., 2021).
Viewpoint Different viewpoints have been studied in the gaze literature. 2D image coordinates, 2D screen
coordinates, 3D subject coordinates, and 3D camera coordinates are the used viewpoints of coordinate systems. Head
pose is captured in different viewpoints such as constraint head poses, unconstrained head poses consisting broad head
yaw and pitch variations (Zhang, Park, Beeler, Bradley, Tang & Hilliges, 2020). Similarly, ocular regions are captured
in multiple viewpoints such as without occlusion, partial occlusion, and total occlusion (Kellnhofer et al., 2019). Either
head-mounted displays or remote cameras such as webcams, kinect, and surveillance cameras are used to collect images
from the different viewpoints.
Challenging conditions Multi-user gaze estimation in unconstrained settings introduce numerous challenging
conditions. Datasets should capture these challenging conditions such as eye, face and body occlusion (Kellnhofer et al.,
2019), subject distortions such as scenarios where subjects are wearing spectacles (Tomas et al., 2021). Furthermore,
datasets should capture scene images with varying camera-to-subject distances (Mishra & Lin, 2020) and different
illumination conditions (Zhang et al., 2015).
First Author et al.: Preprint submitted to Elsevier Page 10 of 29
Short Title of the Article
6.2. Public gaze datasets
Recent research on eye gaze estimation have used different types datasets with the growth of deep learning
techniques. Most of the publicly available datasets have used head mounted devices, surveillance camera and other
desktop and mobile eye trackers to capture images for eye tracking, head pose detection and pupil tracking. A summary
of common gaze estimation datasets is given in Table 2. Most of the existing datasets supports single user eye
gaze estimation process and there is a lack of datasets with multi user eye gaze images. Some of the datasets are
captured in controlled environments, whereas the others are acquired in uncontrolled (wild) settings. Moreover, Figure
8 includes sample images from the publicly available datasets, where (a) MPIIGaze (b) Columbia Gaze, (c) Gaze360,
(d) GazeFollow, (e) Gaze on objects.
Figure 8: Sample images from publicly available datasets.
Some of the gaze estimation datasets that are widely used in related studies can be listed as follows.
MPIIGaze Zhang et al. (2015) have presented the MPIIGaze dataset. It is a novel in-the-wild gaze dataset and
one of the most widely used datasets for estimating gaze using appearance-based methods. This dataset was collected
utilizing laptops over a three-month period that demonstrate significant variations in eye appearance. Even though the
original dataset only contains binocular eye images, the improved version of the dataset includes face images (Zhang,
Sugano, Fritz & Bulling, 2017) and manually annotated landmarks (Zhang, Sugano, Fritz & Bulling, 2019b) as well.
It contains 213,659 images which were gathered from fifteen participants, and it includes both 2D and 3D annotations.
Additionally, MPIIGaze provides a standard evaluation dataset that includes 15 participants and 3,000 images of each
participant’s left and right eyes. Most current gaze datasets restrict the head pose range. However, MPIIGaze includes
an extensive head-pose range and a gaze angle range (Sugano, Matsushita & Sato, 2014; Mora, Monay & Odobez,
2014).
Table 2: Summary of Gaze Estimation Datasets
First Author et al.: Preprint submitted to Elsevier Page 11 of 29
Short Title of the Article
Dataset Year Subjects Total Annotations Type Environment
MPIIGaze Zhang
et al. (2015)
2015 15 213,659 2D and 3D gaze di-
rections
Single Wild
Columbia Gaze
Smith, Yin, Feiner
& Nayar (2013)
2013 56 5,880 3D gaze direction Single Controlled
Gaze360
Kellnhofer et al.
(2019)
2019 238 172,000 3D gaze direction Single Wild
GazeFollow
Recasens (2016)
2015 130,339 122, 143 2D gaze direction,
target
Multi Wild
GOOReal Tomas
et al. (2021)
2021 100 9,552 2D gaze direction,
target
Single Wild
UTMultiview Sug-
ano et al. (2014)
2014 50 1,100,000 2D and 3D gaze di-
rection
Single Controlled
EyeDiap Mora
et al. (2014)
2014 16 94(videos) 2D and 3D gaze di-
rection
Single Controlled
GazeCapture Lu,
Okabe, Sugano &
Sato (2014a)
2016 1474 2,400,000 2D gaze direction Single Wild
RT-Gene Fischer
et al. (2018)
2018 15 123,000 2D gaze direction Single Controlled
ETH-XGaze Zhang
et al. (2020)
2020 110 1,100,000 2D and 3D gaze di-
rection
Single Controlled
NVGaze Kim,
Stengel, Majercik,
De Mello, Dunn,
Laine, McGuire &
Luebke (2019)
2020 30 4,500,000 2D gaze direction Single Controlled
TabletGaze Huang
et al. (2017)
2017 51 816 (videos) 2D gaze direction Single Controlled
First Author et al.: Preprint submitted to Elsevier Page 12 of 29
Short Title of the Article
Columbia Gaze Smith et al. (2013) have developed a large publicly available dataset for appearance-based gaze
estimation. The collection contains 5880 high-quality images of 56 subjects (32 males and 24 females), with a
resolution of 5184x3456 pixels for each image. Participants ranged in age from 18 to 36 years, and 21 of them wore
glasses. Twenty-one participants were Asian, nineteen were Caucasian, eight were South Asian, seven were black, and
four were Hispanic or Latina, indicating a greater range of eye appearances. For each subject, they collect images for
each of the seven horizontal gaze directions, five horizontal head poses, and three vertical gaze directions. In the data
collection setting, participants were seated in a fixed place in front of a black background. They were asked to focus on
a dot shown on a wall while their eye gaze was recorded. The 3×7 grid of dots were placed in 10 increments vertically
and ten increments horizontally.
Gaze360Most of the available datasets are not suited for developing a model capable of reliably assessing 3D gaze
in the wild. Kellnhofer et al. (2019) have proposed Gaze360, a large-scale gaze estimate dataset for unconstrained 3D
gaze estimation. Gaze360 is unique for its combination of numerous gaze poses, and head poses, 3D gaze annotations,
a variety of indoor and outdoor locations, and a diversity of subjects like age, sex, ethnicity. The dataset contains
172000 images of 238 participants, and each image has a resolution of 3382×4096 pixels. Dataset has collected in 5
(53 participants) indoor and 2(185 participants) outdoor locations, and 58% of the participants were female and 42% of
the participants male. The dataset enables gaze estimate up to the limit of eye visibility, which in certain circumstances
corresponds to gaze yaws of around ± 140◦. The Gaze360 dataset collecting arrangement was centered on a Ladybug5
360◦ panoramic camera in the scene’s center and a moving target board marked with an AprilTag (Wang & Olson,
2016) and a cross on which participants were asked to gaze constantly. Participants were asked to locate at a distance
of approximately 1-3m from a camera.
GazeFollow Recasens (2016) have built a GazeFollow, a large scale dataset labelled with the 2D image location of
where people in the images are looking. Dataset contains 122,143 images and they use several popular datasets (xiong
Xiao, Hays, Ehinger, Oliva & Torralba, 2010; Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár & Zitnick, 2014;
Yao, Jiang, Khosla, Lin, Guibas&Fei-Fei, 2011; Everingham,Gool,Williams,Winn&Zisserman, 2009; Russakovsky,
Deng, Su, Krause, Satheesh,Ma, Huang, Karpathy, Khosla, Bernstein, C, Berg&Fei-Fei, 2015; Zhou, Lapedriza, Xiao,
Torralba & Oliva, 2014) which utilize people as a source of imagery. These images contain individuals engaged in a
variety of ordinary tasks, and each image contains a single person or multiple people. Since the images do not consist
of ground truth gaze, they have labelled images using Amazon’s Mechanical Turk and their online tool. GazeFollow
dataset designed to capture different fixation scenarios. Several images depict multiple people paying attention to one
another, while others depict individuals looking at each other.
GOOReal Most exiting gaze estimation datasets only have the pixel being looked at and not the boundaries of
a particular object of interest. This lack of object annotations presents an opportunity for advanced gaze estimation
research. Tomas et al. (2021) have introduced the task of gaze object prediction along with the Gaze On Object (GOO)
dataset for the retail environment to address this issue. GOOReal dataset consists of 9552 images of 100 participants
(32 female and 68male), and each image is composed of shelves packed with 24 different classes of product items. Each
participant was instructed to enter the grocery environment, and they would then fixate on each item for few seconds.
Two images were collected for each item stared at, and annotators were attached ground truth label (grocery item id)
for each image. All objects are annotated with their class, bounding box(product items, head area), and segmentation
mask.
6.3. Issues related to public gaze datasets
Many public datasets have several issues and challenges when using in real-world applications. The majority of
datasets are suited for physically constrained applications such as desktop andmobile phone gaze estimation. Typically,
these datasets are collected using a static recording setup, which allows higher accuracy but may lack the diversity
in illumination and motion blur. Therefore these datasets are not valid for general applications. On the other hand,
these datasets contain relatively small head pose angles and gaze variation and are restricted to frontal views. Most of
the existing gaze datasets are not annotated for multi-user gaze estimation. Therefore, additional effort is required to
annotate the images using these datasets in the multi-user gaze estimation process.
6.4. Generated synthetic datasets
Generally, publicly available datasets are primarily used to train and evaluate gaze estimation models. Collecting
accurate gaze estimation data and creating a dedicated gaze estimation dataset requires time, effort, and cost.
Additionally, public datasets are not always suitable and sufficient for a particular task. Tomas et al. (2021) have
First Author et al.: Preprint submitted to Elsevier Page 13 of 29
Short Title of the Article
presented a synthetic dataset called GOO-Synth, and it contains 192000 images. To build GOO-Synth, The Unreal
Engine has been used to create a realistic-looking replica of the scene used in real dataset. Bermejo et al. (2020)
have created a synthetic dataset with 50 subjects to improve the back head detection task in their models. Different
approaches based on techniques such as Mask-RCNN (Shashirangana, Padmasiri, Meedeniya, Perera, Nayak, Nayak,
Vimal & Kadry, 2021) and StyleGAN (Karras, Laine, Aittala, Hellsten, Lehtinen & Aila, 2020) have used in the
literature to generate synthetic datasets.
7. Performance evaluation metrics and standards
Performance evaluation metrics and standards used to evaluate the performance of 2D 3D gaze estimation
techniques are described in this section. In the literature, the type of evaluation metrics has depended on the nature
of gaze estimation, which can be further classified into two broad categories namely 2D gaze estimation and 3D gaze
estimation. Furthermore, these metrics differ depending on the gaze estimation task performed, which can be gaze
point estimation, gaze direction estimation, and gaze object prediction.
Area Under Curve (AUC) Area Under the ROC curve is one of the primary metrics used to evaluate the accuracy
of 2D gaze point estimation (Recasens, 2016; Fang et al., 2021; Tomas et al., 2021; Chong et al., 2020). Judd, Ehinger,
Durand & Torralba (2009) have presented Area Under Curve criteria from a ROC curve to predict the performance of
human saliency maps to predict gaze fixations. The saliency map is treated as a binary classifier for each image pixel
in this metric. The classification threshold is determined in such a way that a specified percentage of picture pixels are
categorized as fixated, while the remainder are classed as unfixed. The AUCwill be one if the model behaves perfectly,
while random performance is 0.5.
L2 Distance L2 distance is another primary metric used to evaluate the accuracy 2D gaze point estimation(Recasens, 2016; Fang et al., 2021; Tomas et al., 2021; Chong et al., 2020). Mean Euclidean distance between the
gaze predictions and their respective ground-truth gaze annotations is defined as L2 distance in 2D gaze estimationliterature (Fang et al., 2021; Recasens, 2016). L2 distance can be obtained from the Equation 1, where gt_xi, gt_yirefers to the ground truth gaze annotations and xi, yi refers to gaze predictions in 2D image coordinates.
𝐿2𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =
1
𝑛
𝑛∑
𝑖=1
√
(𝑔𝑡_𝑥i − 𝑥i)2 + (𝑔𝑡_𝑦i − 𝑦i)2 (1)
Angular Error Some studies have used angular error to determine the accuracy of 2D, and 3D gaze direction
estimation techniques (Kellnhofer et al., 2019; Recasens, 2016; Tomas et al., 2021; Fang et al., 2021). The angular
difference between the predicted and true gaze direction vectors is defined as the angular error. The predicted gaze
direction vector is produced by connecting the head point to the predicted gaze point. This metric is calculated in both
2D and 3D vector spaces.
Average Precision The average Precision metric is used in scenarios where out of frame gaze binary classification
has been considered Fang et al. (2021); Chong et al. (2020). The area under the precision-recall curve is defined as the
average precision as stated in Equation 2.
𝐴𝑃 = ∫
1
0
𝑝(𝑟) 𝑑𝑟 (2)
Classification Accuracy Classification accuracy metric is reported in problems where gaze estimation has been
represented as a classification problem (Akinyelu & Blignaut, 2020; Mahanama, Jayawardana & Jayarathna, 2020). It
is the ratio of correct predictions to total predictions. The accuracy of binary classification is expressed in terms of
positives and negatives as given in Equation 3.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
(3)
Mean Squared Error (MSE) Meas Squared Error is another metric used to determine the accuracy of 2D, and
3D gaze direction estimation techniques (Kellnhofer et al., 2019; Recasens, 2016; Tomas et al., 2021; Fang et al.,
First Author et al.: Preprint submitted to Elsevier Page 14 of 29
Short Title of the Article
2021). Mean Squared Error is defined as the average squared difference between the ground truth and the prediction
(Handelman, Kok, Chandra, Razavi, Huang, Brooks, Lee & Asadi, 2019). In gaze estimation literature, MSE can be
obtained from Equation 4, where yi and gt_yi refers to the predicted gaze and ground truth gaze respectively.
𝑀𝑆𝐸 = 1
𝑛
𝑛∑
𝑖=1
(𝑦i − 𝑔𝑡_𝑦i)2 (4)
8. Related Research Models
Deep learning-based techniques have been widely used in the field of gaze estimation due to their ability to map
high-level gaze features directly from images and produce results in real-world settings (Kellnhofer et al., 2019;Wang&
Shen, 2017; Zhang et al., 2015). CNNs are at the backbone ofmost of these techniques incorporating other deep learning
architectures and techniques such as Capsule Networks, Recurrent-Neural Networks, Residual Neural Networks,Multi-
Task CNNs and Transfer Learning (Kellnhofer et al., 2019; Mahanama et al., 2020; Fang et al., 2021; Chong, Ruiz,
Wang, Zhang, Rozga & Rehg, 2018; Lian, Yu & Gao, 2018).
This section explores deep learning-based gaze estimation methods with a focus on multi-user gaze estimation.
These methods are introduced in two main perspectives, deep learning-based methods for single-user gaze estimation
and deep learning-based methods for multi-user gaze estimation. The surveyed studies are further categorized
according to the coordinate system and environmental settings. The studies on single user and multi user eye gaze
estimation approaches are summarized in Table 3 and Table 4. Moreover, we discuss the recent research studies that
have used 2D and 3D deep learning architectures as demonstrated in Figure 9 - Figure 14.
8.1. Deep Learning Based Methods for Single-User Gaze Estimation
8.1.1. 2D Deep Learning Methods in Constrained-Environments: Single-user
The extraction of the ocular regions is a challenging task in naturalistic environments due to occlusion (Saad,
Elkafrawy, Abdennadher & Schneegass, 2020). Also, extracting head-pose information from ocular regions has not
been explored in detail. Among the related studies, Mahanama et al. (2020) have proposed an appearance-based 2D
gaze estimation model Gaze-Net, using capsule networks for decoding, representing, and estimating gaze information
from ocular region images. Capsule networks have been used in contrast to CNN’s with pooling due to their capability
to learn equivariant representations of objects. A two-step approach combining classification of gaze direction into six
classes and reconstructing the original ocular image have been followed to construct and train the deep neural network.
In their work, it has been hypothesized that a single eye image consisting of sufficient information can reliably estimate
the gaze. Two publicly available datasets, MPIIGaze (Zhang et al., 2019b) and the Columbia Gaze (Smith et al., 2013)
has been used to train and test the model. Further, to obtain x,y coordinates of ocular regions in the images, PoseNet
(Oved, Alvarado & Gallo, 2018) has been incorporated. An accuracy of 62% and a mean absolute error of 2.84 has
been recorded for the gaze estimation task.
8.1.2. 2D Deep Learning Methods in the Wild: Single-user
Existing gaze-related dataset annotations only contain the pixel of the gaze, and not include the area of a specific
object of interest. In a related study, Tomas et al. (2021) have addressed this issue by introducing a challenging task
called gaze object prediction.Moreover, they have presented the gaze on objects dataset based on the retail environment
for training and evaluation. The dataset consists of a smaller set of real images (GOO-Real) and a larger synthetic set
of images (GOO-Synth). GOO-Real consists of 100 human and 9552 images. Goo-Synth consists of 192000 images
created with Unreal Engine. All Objects in the frame are annotated with their class, bounding box, and segmentation
mask. GOO dataset can be used in gaze following, Gaze object prediction, and Domain adaptation. Several baselines
(Chong et al., 2020; Lian et al., 2018; Chong et al., 2020) are benchmarked on GOO dataset. They have evaluated
using standard metrics such as the area under the ROC curve (AUC), the L2 Distance, and the angular error. Baselineevaluation results consistently show themodels training on the GOO-Synth dataset before being trained on a GOO-Real
dataset to achieve higher performance on all metrics.
First Author et al.: Preprint submitted to Elsevier Page 15 of 29
Short Title of the Article
8.1.3. 3D Deep Learning Methods in Constrained-Environments: Single-user
Most appearance-based eye gaze estimation methods have only used encoded features from eye images. In addition,
gaze estimation tasks are limited to 2D screen mapping. Zhang et al. (2017) have proposed a 2D and 3D appearance-
based gaze estimation method that uses face images as the input. The proposed model architecture is based on CNNs.
They have introduced additional layers that learn spatial weights to activate the last convolutional layer, to efficiently
use the face information. The spatial weights mechanism forces the network to understand and learn the importance
of various face regions for gaze estimation. This mechanism has been implemented using the concept of the 1x1
convolutional layer and the rectified linear unit layer. Through the evaluation, their method outperforms state-of-the-
art for both 2D and 3D gaze estimation, reaching an accuracy of 6◦ and 4.8◦, improvements of up to 27.7% and 14.3%
on EYEDIAP (Mora et al., 2014) and MPIIGaze (Zhang et al., 2019b) for 3D gaze estimation.
Another approach for 3D single-user gaze estimation has been proposed by Lian, Zhang, Luo, Hu, Wu, Li, Yu &
Gao (2019) usingmulti-task CNNs. In this work, 3D gaze estimation task has been introduced as RGBDgaze estimation
by incorporating the depth channel as well. A generative adversarial networks (GAN) has been used for depth image
generation to reduce noise and black holes. The proposed network architecture combines an eyeball feature extractor,
a head pose extractor, and a 3D eye position encoder to predict the gaze point by taking two single eye images and an
RGBD (Red, Green, Blue, Depth) head image as inputs.
8.1.4. 3D Deep Learning Methods in the Wild: Single-user
Many related studies have explored gaze target detection without incorporating the depth estimation of gaze
prediction (Chong et al., 2020; Recasens, 2016). As a solution, Fang et al. (2021) have proposed amethod for gaze target
detection in the unconstrained environments based on deep CNNs. As shown in Figure 9, the authors have introduced
a novel architecture for the task by incorporating 3D gaze estimation and a dual attention module (DAM) consisting of
a field of view mask and a gaze-depth channel. The model has used a single image in the wild as the input and outputs
a 2D saliency map.
Figure 9: Representation of the model architecture presented by Fang et al. (2021).
In another study by Ranftl, Lasinger, Hafner, Schindler & Koltun (2020), a priori depth map has been used
employing to generate the depthmap of the image. A coarse-to-fine strategy has been developed for 3D gaze estimation,
which can cope with completely occluded eyes and faces. The task of gaze target prediction has been presented as a
combination of two sub-tasks; 1) Identifying whether the gaze target is inside or out of the image, 2) Locating the target
if inside. The output from the DAM and the scene image has passed to a ResNet-50 backbone and then to a binary
classification head and a heatmap regression head to obtain the two results. They have used Gaze360(Kellnhofer et al.,
2019), GazeFollow (Recasens, 2016) datasets and VideoAttentionTarget(Chong et al., 2020) dataset to train, test and
finetune the model respectively. The proposed method has produced on par results as a single human, achieving 14.9°
angular error, 0.922 AUC, and 0.896 average precision. This work has shown promising results for single-user gaze
target detection using 2D images in the wild in spite of head-eye inconsistency and occlusion.
Robustly estimating gaze in the wild with varying-camera person distances is another challenge for CNN
backbones. Mishra & Lin (2020) have proposed a novel solution for the task by aggregating multiple zoom scales of the
same input image using the center-cropping technique. Moreover, they have introduced a sine-cosine transform to avoid
the yaw angle discontinuity in 360° backward gaze estimation, which penalizes deep learning models with substantial
First Author et al.: Preprint submitted to Elsevier Page 16 of 29
Short Title of the Article
losses. The aggregation of center cropped input images with multiple sizes has been carried out by spatial-max pooling
and has fed into a ResNet-18 (He, Zhang, Ren & Sun, 2016) backbone and other backbone variants to regress sin(),
cos(), and sin() values. The pinball loss function inspired by Gaze360 (Kellnhofer et al., 2019) has been used to output
the uncertainty of the predictions further. A sequential model using bidirectional Long short-termmemory (LSTM) has
been proposed and a sequence of multi-crops that has achieved better performance on the Gaze360 dataset. The best
mean angular errors achieved for all 360°, front 180°, and back in Gaze360 dataset are 12.4, 10.7, and 18.9, respectively,
using the sequential model with Hard-net (Chao, Kao, Ruan, Huang & Lin, 2019) as the backbone. Validation of the
model on the RT-GENE dataset has achieved a state-of-the-art mean angular error of 6.7 using the static model.
Other work on 3D single-user gaze estimation in the wild can be summarized as follows. Chong et al. (2018, 2020)
have published two consecutive studies using state-of-the-art deep CNNs to predict heatmaps of gazed targets by a
single user. The predecessor has achieved near single-human performance on the GazeFollow dataset for single image
gaze target prediction. A summary of single-user gaze estimation approaches is given in Table. 3.
Table 3: Summary of Single-User Gaze Estimation Approaches
Ref. Architecture Backbone Dataset Performance Coordinate
System
Environment
Zhang et al.
(2017)
CNN-
Spatial
AlexNet Own dataset Ang - 4.8° 2D, 3D Controlled
Chong et al.
(2018)
Multi-task
CNN
ResNet-50 EYEDIAP,
GazeFollow,
SynHead
AUC - 0.896 L2 -0.187 Ang. - 6.4°
3D Wild
Lian et al.
(2019)
Multi-task
CNN
ResNet-34,
Own
EYEDIAP, Own AUC - 0.906 L2 -0.145 MAng - 8.8°
3D Controlled
Chong et al.
(2020)
CNN-LSTM ResNet-50 GazeFollow,
VideoAttention-
Target, VideoCoAtt
AUC - 0.924 L2 -0.096 Out of Frame
AP - 0.925
3D Wild
Mahanama
et al. (2020)
Capsules,
CNN
Own
architecture
MPIIGaze,
Columbia Gaze
Accuracy - 62% 2D Contolled
Mishra &
Lin (2020)
CNN-LSTM ResNet-18,
Hardnet
Gaze360, RT-
GENE
MAng - 12.4° 3D Wild
Tomas et al.
(2021)
CNN-static ResNet-50 GOO, GazeFollow AUC - 0.889 L2 -0.150 Ang. - 29.1°
2D Wild
Fang et al.
(2021)
CNN-static ResNet vari-
ants
Gaze360,
GazeFollow,
VideoAttention-
Target
AUC - 0.922 L2 -0.124 Ang. - 14.9°
3D Wild
First Author et al.: Preprint submitted to Elsevier Page 17 of 29
Short Title of the Article
8.2. Deep Learning Based Methods for Multi-User Gaze Estimation
Gaze estimation of multiple people is a relatively new research area that has been emerging with the adaptation
of deep learning-based methods for gaze estimation (Sugano et al., 2016). A summary of multi-user gaze estimation
approaches is presented in Table 4.
Table 4: Summary of Multi-User Gaze Estimation Approaches
Ref. Architecture Backbone Dataset Performance Coordinate
System
Environment
Recasens
(2016)
CNN with
shifted grids
AlexNet GazeFollow AUC - 0.878
L2 - 0.190Ang. - 24°
2D Wild
Sugano et al.
(2016)
CNN spatio-
temporal
AlexNet Own, Coutrot, Hol-
lywood2
- 3D Wild
Kodama
et al. (2018)
CNN LeNet-5 Own MAE - 10.39 m 3D Wild
Kellnhofer
et al. (2019)
CNN-LSTM ResNet-50 Gaze360 MAng - 13.5° 3D Wild
Lian et al.
(2018)
CNN ResNet-50 GazeFollow,
DLGaze
AUC - 0.906
L2 - 0.081MAng. - 8.8°
2D Wild
Bermejo
et al. (2020)
CNN ResNet-18 UcoHead, Own MAE - 19°
FPS - 0.52
3D Wild
Existing methods of multi-user gaze estimation can be categorized into two categories in such a way that, (1)
techniques that analyze the gazes of multiple people sharing time and space and (2) techniques that explore the gazes
of multiple people sharing only space (Kodama et al., 2018). The first type requires several people to be wearing
head-mounted cameras to estimate each of their gazes, thus hindering its practicality in real-world scenarios due to
the requirement of a head-mounted camera for each person (Kodama et al., 2018; Park et al., 2012; Park & Shi, 2015).
The approaches for the second type are discussed under this section, comparing their performance, reliability, and
challenges. These approaches are presented under two sections based upon the dimensionality of gaze estimation and
the nature of constraints in the environment.
8.2.1. 2D Deep Learning-based Methods in the Wild: Multi-user
Multi-user gaze estimation in a 2D image coordinates system is a timely approach due to the potential of deep
learning techniques in determining gaze direction in unconstrained settings. Recasens (2016) have proposed a deep
neural network-based approach using CNNs for the novel task gaze-following in the wild. A benchmark dataset
GazeFollow has been further presented. Gaze-following is the task of following a person’s gaze to predict the object
being looked at, which had not been received prominent attention until this point. As shown in Figure 10, the head
pose and the gaze orientation are extracted from the scene image.
Figure 10: Representation of the GazeFollow network architecture presented by Recasens (2016).
First Author et al.: Preprint submitted to Elsevier Page 18 of 29
Short Title of the Article
The location of different objects being looked at by different people in the scene is predicted in 2D image
coordinates. Unlike previous work, this approach only uses a single third-person view of the scene, including the person
and the object being gazed at to infer gaze. They have introduced a large-scale dataset, GazeFollow, annotated with
the gaze object annotations by accumulating 122,143 image data consisting of 130,339 people from several significant
datasets for model training and evaluation tasks. An in-depth survey of the dataset is given in Section 6.2. The dataset
is designed to capture various fixation scenarios in which the people count varied from a single person to a crowd of
people. They have described the gaze-following of humans using a gaze pathway that detects the gaze direction and a
saliency pathway that identifies the salient objects. Also a CNN architecture based on AlexNet (Krizhevsky, Sutskever
& Hinton, 2017) is used as the backbone.
The model is designed to support multi-modal predictions for reliable predictions of gaze objects in ambiguous
scenarios. The problem is formulated as a classification task by quantizing the fixation location into an NxN grid
where the size of N is selected using a shifted grids approach. The experimental results of the study show that the
model achieves an AUC of 0.878 and L2 distance of 0.190 for the gaze fixation prediction task where the measuredsingle-human level performance for the task is 0.924 ACU and 0.096 L2 distance. Even though the results show thatthe model is robust to inaccurate head detection, the lack of 3D understanding has generated incorrect predictions in
their work.
A similar approach to GazeFollow Recasens (2016) has been proposed by Lian et al. (2018) for multi-user gaze
point prediction of the target person in a scene. As demonstrated in Figure 11, they have proposed a two-stage solution
consisting of a gaze direction pathway and a heatmap pathway by mimicking the gaze following the behavior of a
human. In the first stage, gaze direction has estimated by head images and its position to generate multi-scale gaze
direction fields. In the second stage, multi-direction gaze fields have concatenated with the original image to regress
the heatmap. Unlike in GazeFollow (Recasens, 2016) two pathways have been associated with each other to mimic
gaze following behaviour of a human. Furthermore, more robust gaze heatmap prediction has been proposed to replace
gaze point estimation. ResNet-50 based DCNN has been used along with a three fully connected layer network for gaze
direction prediction. Adam optimizer has been used to optimize the model training. The heatmap pathway has used a
feature pyramid network (Lin, Dollár, Girshick, He, Hariharan & Belongie, 2017) with a Sigmoid activation function.
GazeFollow dataset and their own video dataset DLGaze have used for model training, validation, and evaluation. The
experimental study has shown a mean angular error of 8.8°, which has surpassed the 11.6° result in (Recasens, 2016).
The authors have stated that the two-stage architecture inspired by human behavior is the reason for the improved
performance.
Figure 11: Representation of the model architecture presented by Lian et al. (2018).
8.2.2. 3D Deep Learning-based Methods in the Wild: Multi-user
Application independent 3D gaze estimation in the wild serves as a good entry point for many applications in
the domain. (Kellnhofer et al., 2019) have proposed a robust appearance-based method for 3D gaze estimation in
unconstrained images of large diversity using bidirectional Long Short-Term Memory capsules (LSTM) (Graves,
Fernández & Schmidhuber, 2005). As shown in Figure 12, the authors have presented Gaze360, a large gaze estimation
dataset containing 172K images consisting of 238 subjects, a wide range of gaze and head pose angles, significant
First Author et al.: Preprint submitted to Elsevier Page 19 of 29
Short Title of the Article
variation in natural illumination, and diverse, arbitrary environments for the task. The gap between leveraging the full
potential of DCNN and the lack of sufficient annotated diverse data for the task is bridged through the approach.
The proposed model emphasizes the temporal nature and the continuity of gaze as a signal by aggregating seven
image frames to predict the gaze of the central frame using LSTM capsules. An ImageNet-pre-trained ResNet-18 (He
et al., 2016) architecture is used as the CNN backbone to predict the gaze in real-world 3D spherical coordinates, and
an uncertainty value for a gaze prediction is introduced and measured using quantile regression (Koenker, 2005) by the
pinball loss function. The uncertainty prediction and not relying on eye or face detectors allowed the model to estimate
a gaze direction even in fully occluded eyes robustly. Mean angular errors (MAE) are calculated for various static
and temporal models to validate the gaze estimation and calculated the correlation between the actual error and the
predicted uncertainty using Spearman’s rank correlation. 13.5, 11.4, and 11.1 MAE’s were obtained for all 360°, front
180°, and front-facing scenarios, respectively, and an uncertainty correlation of 0.45 was obtained using the proposed
method.
Figure 12: Representation of the Gaze360 model architecture by Kellnhofer et al. (2019).
Kodama et al. (2018) have proposed a method for localizing the common gaze target focused on by a crowd of
people in a tennis stadium using low-resolution images by aggregating the individually estimated 3D gaze of each
person, as shown in Figure 13. This study has further analyzed the relationship between the number of people involved
in the aggregation and the localization accuracy of the common gaze target estimation. They have constructed a dataset
of 12,792 images which consists of 96 participants in a tennis stadium using two cameras, and each image consists of
48 people. The dataset further contains 454,739 face images annotated with 3D real-world coordinates with yaw angle
and pitch angle ranging from -74.02 to 74.02 and -20.09 to -3.01, respectively.
Figure 13: Representation of the model architecture by Kodama et al. (2018).
The authors have used a multi-task cascaded CNN-based face detector to detect the faces, which were then used to
train the Le-Net-5 (LeCun, Bottou, Bengio & Haffner, 1998) based gaze angle estimator. In the experimental study, the
method’s performance has been studied with respect to the number of people involved in the aggregation, considering
the single-person case as the baseline. They achieved a 13.99m mean absolute error of estimated gaze point for the
First Author et al.: Preprint submitted to Elsevier Page 20 of 29
Short Title of the Article
baseline and reduced it to 10.39m by aggregating 24 people. Their comprehensive experimental study indicates promise
for aggregating individual gaze estimations for more accurate common gaze target prediction in the wild. However,
a more robust aggregation method still needs to be developed where individual gaze estimations contain significant
biases.
An application-specific method for 3D Multi-user gaze estimation in the wild has been explored by Bermejo et al.
(2020). They have proposed an exciting approach EyeShopper, to analyze customer behavior in retail stores using
gaze estimation from back-head images of shoppers as shown in Figure 14. They have further generated a synthetic
back-head image dataset of 144,000 images consisting of 50 subjects and ±90° head yaw and pitch variations due to
the unavailability of public back-head datasets in the wild. In this work, they have assumed that the customer’s gaze
can be predicted based on the customer’s head position when the subject’s face is not visible. With this assumption,
they have proposed an accurate DCNN based architecture for gaze estimation using head-pose from back-head images
and a novel loss function. A fine-tuned version of You only look once (YOLO) v3 model as the back-head detector
and a hybrid coarse-fine approach using a static ResNet-18 backbone as the head pose estimator has been used. The
coarse-fine approach combines a four-class head-pose classification layer and a fine regression layer implemented using
fully connected layers. The proposed model has been trained with 122,092 images and validated on 26,184 images by
combining images from UcoHead (Muñoz-Salinas, Yeguas-Bolivar, Saffiotti & Medina-Carnicer, 2012) dataset, a
manually labeled dataset, and the synthetic dataset. For backhead gaze estimation, a mean absolute error of 19°, which
is 10% lower than Hopenet (Ruiz, Chong & Rehg, 2018) has been achieved along with an average of 0.52 frames per
second (FPS).
Figure 14: Representation of the EyeShopper system architecture by Bermejo et al. (2020) .
9. Discussion
9.1. Criteria for Selecting a Research Approach
We provide our suggestions for the selection of the deep learning-based gaze estimation approach in a practical
point of view, as shown in Figure 15. The criteria is based on the performance metrics and implementation issues
in gaze estimation literature. By considering the majority vote of surveyed papers, we assumed that deep learning-
based gaze estimation approaches are employed in unconstrained settings while model-based methods are used in
constrained situations. The effectiveness of the aforementioned techniques depends of the environment settings, head
angle, distance variance, subject count, available computational resources and the other constraints. Additionally, our
selection criteria are confined to methodologies based on deep learning and do not take into account the availability of
datasets for decision making. These guidelines can be used as an advisory for the practitioners and should not consider
as a rigid criterion.
9.2. Open Challenges and Future Research Directions
The existing appearance-based gaze estimation methods can be broadly divided into single-user gaze estimation
and multi-user gaze estimation. Multi-user gaze estimation has not received considerable attention in the literature.
With the adaptation of deep learning-based techniques in this domain, most of the studies have progressed into gaze
estimation in real-world scenarios with unconstrained settings in the last decade. As per this adaptation, the field has
been confronted with numerous challenges and future opportunities. Achieving real-time inference speeds for multi-
user gaze estimation has not yet been explored and remains a significant challenge in the field. The application-specific
First Author et al.: Preprint submitted to Elsevier Page 21 of 29
Short Title of the Article
Figure 15: Guide to select a deep learning-based gaze estimation approach.
approach of Bermejo et al. (2020) for estimating shoppers’ gaze in retail has reported an average of 0.52 FPS for the
task.
Moreover, a generalized deep learning model for multi-user gaze estimation in unconstrained settings has not been
explored and remains a challenge. This generalized model should not restrict to a specific application, environmental
constraints, and a given number of users. In another point of view the eye gaze estimation solutions can be integrated
with other related fields such as human tracker (Gamage, Sudasingha, Perera & Meedeniya, 2018) and face detection
(Meedeniya & Ratnaweera, 2007) systems to provide a complete product in a low cost environment where the models
can be deployed in edge devices (Shashirangana et al., 2021). Also, a standard framework can be developed for
performance evaluation of eye gaze systems (Kar & Corcoran, 2017).
Furthermore, the success of deep learning algorithms is due to the availability of large-scale datasets and
computational resources. In this field, the requirement of a large-scale generalized dataset remains a substantial
challenge. The currently available, GazeFollow dataset is limited to 2D multi-user gaze annotations in the image
coordinates. These challenges are the basis for future research. Thus, the future directions can be summarized as
follows.
• DCNNbased approaches formulti-user gaze estimation have only been explored to a small extent in the literature.
Current work has considered a few application domains such as retail industry and crowd behavior analysis
(Kodama et al., 2018; Tomas et al., 2021; Bermejo et al., 2020). Therefore future research can consider applying
multi-user gaze estimation in different application domains.
• Most CNN-based techniques reviewed in this paper for multi-user gaze estimation have not focused on the
throughput of the approach. Future work can focus on acquiring a trade-off between the accuracy and the
inference rate to produce a viable solution in unconstrained environments.
• Although multiple datasets exist for single-user gaze estimation, a standard publicly available dataset for multi-
user gaze estimation remains a limitation. Future work can produce a large-scale generalized multi-user gaze
First Author et al.: Preprint submitted to Elsevier Page 22 of 29
Short Title of the Article
dataset considering different head poses, illumination conditions, facial and head occlusions, subject variations,
and target variations.
• In the multi-user gaze estimation literature, a standard performance evaluation framework has not been
observed. Hence, future work can develop a framework for performance evaluation in multi-user gaze estimation
considering unconstrained environments, target variations, subject variations, accuracy, and inference rates.
This survey presents an in-depth overview of deep learning-based gaze estimation techniques focusing on multi-
user gaze estimation in real-world conditions by highlighting their advantages and limitations. Furthermore, we provide
critical analysis on the related models, describe available datasets, coordinate systems, performance evaluation metrics
and standards, together with the challenges and future opportunities in the field. Although only a few studies are done in
the specific field of multi-user gaze estimation, our study describes the state-of-the-art research with a comprehensive
benchmark to encourage more work in this field. We believe that this field possesses a high potential in demand for gaze
estimation applications in real-world settings. Finally, this survey can be used as a guideline for deep learning-based
gaze estimation research.
10. Conclusion
Eye gaze estimation solutions are beneficial to many application domains including commercial, social andmedical
health. This survey mainly explored the state-of-the-art approaches used in eye gaze research focusing on deep
learning techniques. This study critically analysed the related models in appearance-based gaze estimation approaches
using deep learning techniques. In comparison to model-based methods and conventional appearance-based methods,
appearance-based methods with deep learning perform robustly in unconstrained environment settings such as extreme
head-pose variations, illumination conditions, eye and face occlusions. Furthermore, they can learn a complex non-
linear mapping function directly from image data to gaze without the requirement of a dedicated device. It was observed
that single-user gaze estimation approaches have been broadly studied in constrained and unconstrained environments,
achieving near-human performance. However, multi-user gaze estimation studies have been explored in few application
domains such as retail and crowd-behaviour analysis. Moreover, we have presented the strengths and challenges in
related techniques and the features of publicly available datasets. Finally, we have provided suggestions for selecting eye
gaze estimation approaches and discussed possible future research directions, which can be beneficial for researchers
and developers in the field.
References
Akinyelu, A. A., & Blignaut, P. (2020). Convolutional Neural Network-Based Methods for Eye Gaze Estimation: A
Survey. IEEE Access, 8, 142581–142605. doi:10.1109/ACCESS.2020.3013540.
Bermejo, C., Chatzopoulos, D., & Hui, P. (2020). EyeShopper: Estimating Shoppers’ Gaze using CCTV Cameras.
MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia, (pp. 2765–2774). doi:10.
1145/3394171.3413683.
Cazzato, D., Leo, M., Distante, C., & Voos, H. (2020). When i look into your eyes: A survey on computer
vision contributions for human gaze estimation and tracking. Sensors (Switzerland), 20, 1–42. doi:10.3390/
s20133739.
Chao, P., Kao, C.-Y., Ruan, Y.-S., Huang, C.-H., & Lin, Y.-L. (2019). Hardnet: A low memory traffic network. In
Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3552–3561).
Cheng, Y., Wang, H., Bao, Y., & Lu, F. (2021). Appearance-based Gaze Estimation With Deep Learning: A Review
and Benchmark, . (pp. 1–21). URL: http://arxiv.org/abs/2104.12668. arXiv:2104.12668.
Chennamma, H. R., & Yuan, X. (2013). A Survey on Eye-Gaze Tracking Techniques, . 4, 388–393. URL:
http://arxiv.org/abs/1312.6410. arXiv:1312.6410.
Chong, E., Ruiz, N., Wang, Y., Zhang, Y., Rozga, A., & Rehg, J. M. (2018). Connecting Gaze, Scene, and Attention:
Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency. Lecture Notes in Computer
First Author et al.: Preprint submitted to Elsevier Page 23 of 29
Short Title of the Article
Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11209
LNCS, 397–412. doi:10.1007/978-3-030-01228-1_24. arXiv:1807.10437.
Chong, E., Wang, Y., Ruiz, N., & Rehg, J. M. (2020). Detecting attended visual targets in video. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5396–5406).
De Silva, S., Dayarathna, S., Ariyarathne, G., Meedeniya, D., Jayarathna, S., & Michalek, A. (2021). Computational
Decision Support System for ADHD Identification . International Journal of Automation and Computing (IJAC),
18, 233–255. doi:10.1007/s11633-020-1252-1.
De Silva, S., Dayarathna, S., Ariyarathne, G., Meedeniya, D., Jayarathna, S., Michalek, A., & Jayawardena, G. (2019).
A rule-based system for ADHD identification using eye movement data. In Proceedings of the Moratuwa
Engineering Research Conference (MERCon) (pp. 538–543). doi:10.1109/MERCon.2019.8818865.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2009). The pascal visual object classes
(voc) challenge. International Journal of Computer Vision, 88, 303–338.
Fang, Y., Tang, J., Shen, W., Shen, W., Gu, X., Song, L., & Zhai, G. (2021). Dual attention guided gaze target detection
in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.
11390–11399).
Fischer, T., Chang, H. J., & Demiris, Y. (2018). Rt-gene: Real-time eye gaze estimation in natural environments. In
Proceedings of the European Conference on Computer Vision (ECCV) (pp. 334–352).
Gamage, G., Sudasingha, I., Perera, I., & Meedeniya, D. (2018). Reinstating dlib correlation human trackers under
occlusions in human detection based tracking. In Proceedings of the 18th International Conference on Advances
in ICT for Emerging Regions (ICTer) (pp. 92–98). doi:10.1109/ICTER.2018.8615551.
Ghani, M. U., Chaudhry, S., Sohail, M., & Geelani, M. N. (2013). Gazepointer: A real time mouse pointer control
implementation based on eye gaze tracking. In Proceedings of the 16th International Multi-Topic Conference (pp.
154–159). IEEE.
Goldberg, J. H., & Kotval, X. P. (1999). Computer interface evaluation using eye movements: methods and constructs.
International journal of industrial ergonomics, 24, 631–645.
Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional lstm networks for improved phoneme classification
and recognition. In International conference on artificial neural networks (pp. 799–804). Springer.
Guojun, Y., & Saniie, J. (2016). Eye tracking using monocular camera for gaze estimation applications. IEEE
International Conference on Electro Information Technology, 2016-August, 292–296. doi:10.1109/EIT.2016.
7535254.
Gwon, S. Y., Cho, C. W., Lee, H. C., Lee, W. O., & Park, K. R. (2013). Robust eye and pupil detection method for
gaze tracking. International Journal of Advanced Robotic Systems, 10, 98.
Handelman, G. S., Kok, H. K., Chandra, R. V., Razavi, A. H., Huang, S., Brooks, M., Lee, M. J., & Asadi, H. (2019).
Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. American
Journal of Roentgenology, 212, 38–43.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition (pp. 770–778).
Huang, Q., Veeraraghavan, A., & Sabharwal, A. (2017). Tabletgaze: dataset and analysis for unconstrained appearance-
based gaze estimation in mobile tablets. Machine Vision and Applications, 28, 445–461.
Ji, Q., Zhu, Z., & Lan, P. (2004). Real-time nonintrusive monitoring and prediction of driver fatigue. IEEE transactions
on vehicular technology, 53, 1052–1068.
First Author et al.: Preprint submitted to Elsevier Page 24 of 29
Short Title of the Article
Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In 2009 IEEE 12th
international conference on computer vision (pp. 2106–2113). IEEE.
Kacete, A., Séguier, R., Collobert, M., & Royan, J. (2016). Unconstrained gaze estimation using random forest
regression voting. In Asian Conference on Computer Vision (pp. 419–432). Springer.
Kar, A., & Corcoran, P. (2017). A review and analysis of eye-gaze estimation systems, algorithms and performance
evaluation methods in consumer platforms. IEEE Access, 5, 16495–16519. doi:10.1109/ACCESS.2017.
2735633.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image
quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp. 8110–8119).
Kasprowski, P., & Harężlak, K. (2014). Cheap and easy pin entering using eye gaze. Annales Universitatis Mariae
Curie-Sklodowska, sectio AI–Informatica, 14, 75–84.
Kellnhofer, P., Recasens, A., Stent, S., Matusik, W., & Torralba, A. (2019). Gaze360: Physically unconstrained gaze
estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.
6912–6921).
Kerr-Gaffney, J., Harrison, A., & Tchanturia, K. (2019). Eye-tracking research in eating disorders: A systematic review.
International Journal of Eating Disorders, 52, 3–27.
Khan, M. Q., & Lee, S. (2019). Gaze and eye tracking: techniques and applications in adas. Sensors, 19, 5540.
Kim, J., Stengel, M., Majercik, A., De Mello, S., Dunn, D., Laine, S., McGuire, M., & Luebke, D. (2019). Nvgaze:
An anatomically-informed dataset for low-latency, near-eye gaze estimation. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems (p. 1–12). NewYork, NY, USA: Association for Computing
Machinery. URL: https://doi.org/10.1145/3290605.3300780.
Kodama, Y., Kawanishi, Y., Hirayama, T., Deguchi, D., Ide, I., Murase, H., Nagano, H., & Kashino, K. (2018).
Localizing the gaze target of a crowd of people. In Proceedings of the 14th Asian Conference on Computer
Vision (pp. 15–30). Springer.
Koenker, R. (2005). Quantile regression: economic society monograph serie. New York: Cambridge University Press.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks.
Commun. ACM, 60, 84–90. URL: https://doi.org/10.1145/3065386. doi:10.1145/3065386.
Kumar, M., Garfinkel, T., Boneh, D., & Winograd, T. (2007a). Reducing shoulder-surfing by using gaze-based
password entry. In Proceedings of the 3rd symposium on Usable privacy and security (pp. 13–19).
Kumar, M., Paepcke, A., &Winograd, T. (2007b). Eyepoint: practical pointing and selection using gaze and keyboard.
In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 421–430).
Kwon, Y.-M., Jeon, K.-W., Ki, J., Shahab, Q. M., Jo, S., & Kim, S.-K. (2006). 3D Gaze Estimation and Interaction to
Stereo Display. International Journal of Virtual Reality, 5, 41–45. doi:10.20870/ijvr.2006.5.3.2697.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86, 2278–2324.
Lee, E. C., Ko, Y. J., & Park, K. R. (2009). Gaze tracking based on active appearance model and multiple support
vector regression on mobile devices. Optical Engineering, 48, 077002.
Lee, H. C., Luong, D. T., Cho, C.W., Lee, E. C., & Park, K. R. (2010). Gaze tracking system at a distance for controlling
iptv. IEEE Transactions on Consumer Electronics, 56, 2577–2583.
Lee, S.-H., Lee, J.-Y., & Choi, J.-S. (2011). Design and implementation of an interactive hmd for wearable ar system.
In 2011 17th Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV) (pp. 1–6). IEEE.
First Author et al.: Preprint submitted to Elsevier Page 25 of 29
Short Title of the Article
Lian, D., Yu, Z., & Gao, S. (2018). Believe It or Not, We Know What You Are Looking At! In Proceedings of
the 14th Asian Conference on Computer Vision (pp. 35–50). Springer volume 11363 LNCS. doi:10.1007/
978-3-030-20893-6_3.
Lian, D., Zhang, Z., Luo, W., Hu, L., Wu, M., Li, Z., Yu, J., & Gao, S. (2019). Rgbd based gaze estimation via
multi-task cnn. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 2488–2495). volume 33.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object
detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft
coco: Common objects in context. In Proceedings of the 13th European conference on computer vision (pp.
740–755). Springer.
Liu, D., Dong, B., Gao, X., &Wang, H. (2015). Exploiting eye tracking for smartphone authentication. In International
Conference on Applied Cryptography and Network Security (pp. 457–477). Springer.
Lu, F., Okabe, T., Sugano, Y., & Sato, Y. (2014a). Learning gaze biases with head motion for head pose-free gaze
estimation. Image Vision Comput., 32, 169–179. doi:10.1016/j.imavis.2014.01.005.
Lu, F., Sugano, Y., Okabe, T., & Sato, Y. (2014b). Adaptive linear regression for appearance-based gaze estimation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2033–2046.
Mahanama, B., Jayawardana, Y., & Jayarathna, S. (2020). Gaze-Net: Appearance-based gaze estimation using capsule
networks. In Proceedings of the 11th Augmented Human International Conference (pp. 18–21). doi:10.1145/
3396339.3396393.
Meedeniya, D., &Ratnaweera, A. (2007). Enhanced face recognition through variation of principle component analysis
(PCA). In Proceedings of International Conference on Industrial and Information Systems (ICIIS) (pp. 347–352).
Peradeniya, SriLanka. doi:https://doi.org/10.1109/iciinfs.2007.4579200.
Mishra, A., & Lin, H.-T. (2020). 360-Degree Gaze Estimation in theWild UsingMultiple Zoom Scales. arXiv preprint
arXiv:2009.06924, .
Mora, K. A. F., Monay, F., & Odobez, J. (2014). Eyediap: a database for the development and evaluation of gaze
estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research
and Applications (p. 255–258). doi:10.1145/2578153.2578190.
Morimoto, C. H., &Mimica, M. R. (2005). Eye gaze tracking techniques for interactive applications. Computer vision
and image understanding, 98, 4–24.
Muñoz-Salinas, R., Yeguas-Bolivar, E., Saffiotti, A., & Medina-Carnicer, R. (2012). Multi-camera head pose
estimation. Machine Vision and Applications, 23, 479–490.
Oved, D., Alvarado, I., & Gallo, A. (2018). Real-time human pose estimation in the browser
with tensorflow. TensorFlow Medium, . URL: https://blog.tensorflow.org/2018/05/
real-time-human-pose-estimation-in.html.
Park, H. S., Jain, E., & Sheikh, Y. (2012). 3D social saliency from head-mounted cameras. In Proceedings of the 25th
International Conference on Neural Information Processing Systems - Volume 1 (pp. 422–430).
Park, H. S., & Shi, J. (2015). Social saliency prediction. Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 07-12-June, 4777–4785. doi:10.1109/CVPR.2015.7299110.
Piumsomboon, T., Lee, G., Lindeman, R. W., & Billinghurst, M. (2017). Exploring natural eye-gaze-based interaction
for immersive virtual reality. In 2017 IEEE symposium on 3D user interfaces (3DUI) (pp. 36–39). IEEE.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation:
Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine
Intelligence, . doi:10.1109/TPAMI.2020.3019967.
First Author et al.: Preprint submitted to Elsevier Page 26 of 29
Short Title of the Article
Raptis, G. E., Katsini, C., Belk, M., Fidas, C., Samaras, G., & Avouris, N. (2017). Using eye gaze data and visual
activities to infer human cognitive styles: method and feasibility studies. In proceedings of the 25th conference
on user modeling, Adaptation and Personalization (pp. 164–173).
Recasens, A. R. C. (2016). Where are they looking?. Ph.D. thesis Massachusetts Institute of Technology.
Ruiz, N., Chong, E., & Rehg, J. M. (2018). Fine-grained head pose estimation without keypoints. In Proceedings of
the IEEE conference on computer vision and pattern recognition workshops (pp. 2074–2083).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., C, A., Berg, & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal
of Computer Vision, 115, 211–252. doi:10.1007/s11263-015-0816-y.
Saad, A., Elkafrawy, D. H., Abdennadher, S., & Schneegass, S. (2020). Are they actually looking? identifying
smartphones shoulder surfing through gaze estimation. In ACM Symposium on Eye Tracking Research and
Applications (pp. 1–3).
Shashirangana, J., Padmasiri, H., Meedeniya, D., Perera, C., Nayak, S. R., Nayak, J., Vimal, S., & Kadry, S. (2021).
License Plate Recognition Using Neural Architecture Search for Edge Devices . International Journal of
Intelligent Systems, (pp. 1–38). doi:https://doi.org/10.1002/int.22471.
Sibert, L. E., & Jacob, R. J. (2000). Evaluation of eye gaze interaction. In Proceedings of the SIGCHI conference on
Human Factors in Computing Systems (pp. 281–288).
Sidorakis, N., Koulieris, G. A., & Mania, K. (2015). Binocular eye-tracking for the control of a 3d immersive
multimedia user interface. In 2015 IEEE 1St workshop on everyday virtual reality (WEVR) (pp. 15–18). IEEE.
Smith, B. A., Yin, Q., Feiner, S. K., &Nayar, S. K. (2013). Gaze locking: passive eye contact detection for human-object
interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology (pp.
271–280).
Špakov, O., & Miniotas, D. (2005). Gaze-based selection of standard-size menu items. In Proceedings of the 7th
international conference on Multimodal interfaces (pp. 124–128).
Sugano, Y., Matsushita, Y., & Sato, Y. (2014). Learning-by-synthesis for appearance-based 3d gaze estimation. 2014
IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1821–1828).
Sugano, Y., Zhang, X., & Bulling, A. (2016). AggreGaze: Collective estimation of audience attention on public
displays. UIST 2016 - Proceedings of the 29th Annual Symposium on User Interface Software and Technology,
(pp. 821–831). doi:10.1145/2984511.2984536.
Sun, W., Sun, N., Guo, B., Jia, W., & Sun, M. (2016). An auxiliary gaze point estimation method based on facial
normal. Pattern Analysis and Applications, 19, 611–620.
Sun, X., Xu, L., & Yang, J. (2007). Driver fatigue alarm based on eye detection and gaze estimation. InMIPPR 2007:
Automatic Target Recognition and Image Analysis; andMultispectral Image Acquisition (p. 678612). International
Society for Optics and Photonics volume 6786.
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., & Nießner, M. (2018). Facevr: Real-time facial reenactment
and eye gaze control in virtual reality. ACM Transactions on Graphics, 37. doi:10.1145/3182644.
Tomas, H., Reyes, M., Dionido, R., Ty, M., Mirando, J., Casimiro, J., Atienza, R., & Guinto, R. (2021). Goo: A dataset
for gaze object prediction in retail environments. arXiv:2105.10793.
Tsukada, A., Shino, M., Devyver, M., & Kanade, T. (2011). Illumination-free gaze estimation method for first-person
vision wearable device. Proceedings of the IEEE International Conference on Computer Vision, (pp. 2084–2091).
doi:10.1109/ICCVW.2011.6130505.
First Author et al.: Preprint submitted to Elsevier Page 27 of 29
Short Title of the Article
Velichkovsky, B. B., Rumyantsev, M. A., & Morozov, M. A. (2014). New solution to the midas touch problem:
Identification of visual commands via extraction of focal fixations. procedia computer science, 39, 75–82.
Wang, H., Dong, X., Chen, Z., & Shi, B. E. (2015). Hybrid gaze/eeg brain computer interface for robot arm control on
a pick and place task. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC) (pp. 1476–1479). IEEE.
Wang, H., Pi, J., Qin, T., Shen, S., & Shi, B. E. (2018a). Slam-based localization of 3d gaze using a mobile eye tracker.
In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (pp. 1–5).
Wang, J., & Olson, E. (2016). Apriltag 2: Efficient and robust fiducial detection. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS) (pp. 4193–4198).
Wang, W., & Shen, J. (2017). Deep visual attention prediction. IEEE Transactions on Image Processing, 27, 2368–
2378.
Wang, Y., Zhao, T., Ding, X., Peng, J., Bian, J., & Fu, X. (2018b). Learning a gaze estimator with neighbor selection
from large-scale synthetic eye images. Know.-Based Syst., 139, 41–49. URL: https://doi.org/10.1016/j.
knosys.2017.10.010. doi:10.1016/j.knosys.2017.10.010.
xiong Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition
from abbey to zoo. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp.
3485–3492).
Xu, P., Ehinger, K. A., Zhang, Y., Finkelstein, A., Kulkarni, S. R., & Xiao, J. (2015). Turkergaze: Crowdsourcing
saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755, .
Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases
of action attributes and parts. 2011 International Conference on Computer Vision, (pp. 1331–1338).
Young, L. R., & Sheena, D. (1975). Survey of eye movement recording methods. Behavior research methods &
instrumentation, 7, 397–429.
Yu, L., Xu, J., & Huang, S. (2016). Eye-gaze tracking system based on particle swarm optimization and bp neural
network. In 2016 12th World Congress on Intelligent Control and Automation (WCICA) (pp. 1269–1273).
doi:10.1109/WCICA.2016.7578296.
Zhai, S., Morimoto, C., & Ihde, S. (1999). Manual and gaze input cascaded (magic) pointing. In Proceedings of the
SIGCHI conference on Human factors in computing systems (pp. 246–253).
Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., & Hilliges, O. (2020). Eth-xgaze: A large scale dataset for gaze
estimation under extreme head pose and gaze variation. In Proceedings of the 16th European Conference on
Computer Vision (pp. 365–381). doi:10.1007/978-3-030-58558-7_22.
Zhang, X., Sugano, Y., & Bulling, A. (2019a). Evaluation of appearance-based methods and implications for gaze-
based applications. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp.
1–13).
Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2015). Appearance-based gaze estimation in the wild. In Proceedings
of the IEEE conference on computer vision and pattern recognition (pp. 4511–4520).
Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2017). It’s written all over your face: Full-face appearance-based
gaze estimation. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (pp.
2299–2308).
Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2019b). Mpiigaze: Real-world dataset and deep appearance-based
gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 162–175.
First Author et al.: Preprint submitted to Elsevier Page 28 of 29
Short Title of the Article
Zheng, R., Nakano, K., Ishiko, H., Hagita, K., Kihira, M., & Yokozeki, T. (2015). Eye-gaze tracking analysis of driver
behavior while interacting with navigation systems in an urban area. IEEE Transactions on Human-Machine
Systems, 46, 546–556.
Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using
places database. In Proceedings of the 27th Advances in Neural Information Processing Systems (p. 487–495).
First Author et al.: Preprint submitted to Elsevier Page 29 of 29
View publication stats