Research Topics
Photo-realistic synthesis and manipulation of images and videos of human faces have received substantial attention lately, thanks to the impressive results that they have achieved. They have a plethora of applications such as movie post-production, actor dubbing in movies, performance capture for visual effects, video-conferencing, video games, photo-realistic affective avatars, virtual assistants, digital art installations and psychology research, to name but a few. The problems involved are particularly challenging, in the cutting edge of today's research and technology.
In this field, my collaborators and I have developed novel deep learning methods for the photo-realistic manipulation of the emotional state of actors in videos under ``in-the-wild'' conditions. We have also developed novel head reenactment systems that are able to fully transfer the head pose, facial expression and eye movements from a source video to a target identity in a completely photo-realistic and faithful manner. To achieve our state-of-the-art results, we leverage our previous 3D face modelling and reconstruction methods (see below) as well as the latest advances on Generative Adversarial Networks (GANs) for image synthesis and domain translation. We are based on parametric 3D face representations of the actor and combine them with novel pipelines for neural face rendering.
Note on social impact: As described above, deep learning systems for facial video synthesis and manipulation like ours can have a positive impact on society and daily life of humans. However, at the same time, this type of technology has the risk of being misused to produce harmful manipulated videos of individuals (e.g. celebrities or politicians) without their consent. This raises concerns related to the creation and distribution of fake news and other negative social impact. We strongly believe that scientists and engineers working in these fields need to be aware of and seriously take into account these risks and ethical issues. We support and contribute to the development of countermeasures, such as raising public awareness about the capabilities of current technology through talks, articles and demos for the general public; developing deep learning systems that detect deepfake videos; and releasing source code under ethical licenses.
Demo videos:
Related publications:
Popular science article about our research:
The human face is one of the most commonly-considered objects in Computer Vision and Computer Graphics. Modelling and reconstructing the detailed 3D shape, appearance and dynamics of the human face has numerous applications, such as augmented reality, performance capture, computer games, visual effects, human-computer interaction, computer-aided craniofacial surgery and facial expression recognition, to name a few.
My collaborators and I have created the so-called Large Scale Facial Model (LSFM) - a 3D Morphable Model (3DMM) of human faces automatically constructed from around 10,000 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed. It contains statistical information from a very wide variety of the human population, which allowed us to construct not only a global 3DMM model but also bespoke models tailored for specific age, gender or ethnicity groups. To build such a large-scale model we developed a novel fully-automated and robust Morphable Model construction pipeline.
More recently, we have expanded our large-scale modelling to the dynamics of human faces by introducing MimicMe, a novel large-scale database of dynamic high-resolution 3D faces. MimicMe contains recordings of 4,700 subjects with a great diversity on age, gender and ethnicity. The recordings are in the form of 4D videos of subjects displaying a multitude of facial behaviours, resulting to over 280,000 3D meshes in total. Based on this database, we have built very powerful blendshapes for parameterising facial behaviour.
Demo videos:
Related publications:
Popular science article about our research:
My collaborators and I have introduced methods that can accurately and robustly fit morphable face models like our LSFM model to a very wide range of real-world images and videos taken by even a simple commodity camera and capturing the face of a human moving, talking and making expressions. More recently, we have also developed deep learning frameworks based on Convolutional Neural Networks (CNNs) that regress the 3D facial mesh from RGB input. These are trained using very large-scale datasets for training (more than 10,000 videos with more than 2,000 unique identities), relying on pseudo-annotations produced by our robust traditional model fitting approaches.
Demo videos:
Related publications:
We are very interested in applying 3D face modelling on human emotion recognition, an increasingly popular line of research that plays a central role in the field of Affective Computing. Analysing the facial expressions is one of the most important non-intrusive ways to infer human emotions. Solving this problem successfully is immensely beneficial for a plethora of applications, e.g. human-computer intelligent interaction, stress analysis for medical research, robotic assistants for children with autism, interactive computer games, emotions transfer, etc.
Towards that goal, we recently constructed a large scale dataset of facial videos (more than 10,000 YouTube videos of people talking), rich in facial dynamics, identities, expressions, appearance and 3D pose variations. We used this dataset to train a deep CNN for estimating expression parameters of a 3D Morphable Model and combine it with an effective back-end emotion classifier. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation and accurately recognizing facial expressions from in-the-wild images. In addition, we have been studying the adoption of similar deep learning frameworks for the closely-related problem of analysing the levels of stress from facial videos.
Demo videos:
Related publications:
Around 70 million Deaf worldwide use Sign Languages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Sign Language technologies can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in Sign Language and translation between sign and spoken language.
My collaborators and I are very interested in developing pioneering methodologies for Automatic Sign Language Recognition, Sign Language Translation, as well as photo-realistic Sign Language Synthesis. Towards these goals, we have introduced new methods and datasets for continuous sign language recognition and have been studying the importance of non-manual signals (eye gaze, facial expressions, body motions, etc) in sign language recognition and synthesis.
Demo videos:
Related publications:
The coastal seafloor on the SouthEast Mediterranean Sea is supporting an increasing number of human development activities and at the same time faces various threats such as sea level rise, coastal erosion and the introduction of invasive species due to ocean warming. Therefore, adequate marine planning requires the employment of advanced methods for mapping shallow seafloor efficiently and in detail.
In collaboration with remote sensing scientists from IMS-FORTH, we are developing an integrated methodology for shallow bathymetry retrieval and detailed mapping of coastal benthic cover of the Cretan shoreline, the transparent waters of which are welcoming for studies with optical imagery. We are developing novel methods for seafloor mapping from multi-temporal, multi-spectral imagery with unmanned aerial vehicles (UAV). In addition we deploy measurements from unmanned surface vehicles (USV) as training data. We are studying the application of state-of-the-art image analysis, neural networks and machine learning techniques for training a system for automated, high resolution 3D reconstruction of coastal seafloor.
Related publications:
For more details, please visit the website of ACTYS project.
Optical flow is a fundamental problem of Computer Vision that seeks to estimate pixel’s displacements between two images of the same scene. Optical flow in the presence of non-rigid deformations is a challenging task and an important problem that continues to attract significant attention from the Computer Vision community. It can play a significant role in a wide variety of problems such as medical imaging, dense non-rigid 3D reconstruction, dense 3D mesh registration, motion segmentation, video re-texturing, super-resolution, facial expression recognition, facial tracking, facial animation and reenactment.
We are interested in generalizations of the traditional optical flow problem, for example by using a large number of images (frames) over long sequences, by incorporating prior knowledge about the observed objects (faces) and/or by estimating the 3D motion field from pairs of monocular images. Towards that goal, we have developed novel algorithms that adopt dense variational formulations, model-based and low rank priors, as well as CNN-based frameworks to solve the relevant problems robustly and accurately.
Demo videos:
Related publications:
My collaborators and I have developed novel methodologies for dense dynamic 3D reconstruction of non-rigid scenes. We introduced the first variational approach to the problem of dense 3D reconstruction of non-rigid surfaces from a monocular video sequence (oral at CVPR 2013). Beyond that, we developed a robust and accurate energy-minimization method for non-rigid video registration (multi-frame optical flow) that provides dense estimation of long-term 2D trajectories, using subspace constraints and convex optimization methods for Computer Vision (IJCV 2013). In addition, we proposed the first algorithm in the literature that solves the problem of simultaneous motion segmentation, motion estimation and dense 3D reconstruction from videos taken with a single hand-held camera and capturing multiple independently moving objects (oral at ISMAR 2012). Note that our aforementioned methodologies are generic and use as only input the video from a single camera, without requiring any markers, additional sensors or prior knowledge about the type of object(s) in the scene.
Demo videos:
Related publications: