Anastasios Roussos

ANASTASIOS (TASSOS) ROUSSOS
Foundation for Research and Technology - Hellas (FORTH), Greece

Research Topics

Photo-realistic Synthesis of Facial Videos

Photo-realistic synthesis and manipulation of images and videos of human faces have received substantial attention lately, thanks to the impressive results that they have achieved. They have a plethora of applications such as movie post-production, actor dubbing in movies, performance capture for visual effects, video-conferencing, video games, photo-realistic affective avatars, virtual assistants, digital art installations and psychology research, to name but a few. The problems involved are particularly challenging, in the cutting edge of today's research and technology.

In this field, my collaborators and I have developed novel deep learning methods for the photo-realistic manipulation of the emotional state of actors in videos under ``in-the-wild'' conditions. We have also developed novel head reenactment systems that are able to fully transfer the head pose, facial expression and eye movements from a source video to a target identity in a completely photo-realistic and faithful manner. To achieve our state-of-the-art results, we leverage our previous 3D face modelling and reconstruction methods (see below) as well as the latest advances on Generative Adversarial Networks (GANs) for image synthesis and domain translation. We are based on parametric 3D face representations of the actor and combine them with novel pipelines for neural face rendering.

Note on social impact: As described above, deep learning systems for facial video synthesis and manipulation like ours can have a positive impact on society and daily life of humans. However, at the same time, this type of technology has the risk of being misused to produce harmful manipulated videos of individuals (e.g. celebrities or politicians) without their consent. This raises concerns related to the creation and distribution of fake news and other negative social impact. We strongly believe that scientists and engineers working in these fields need to be aware of and seriously take into account these risks and ethical issues. We support and contribute to the development of countermeasures, such as raising public awareness about the capabilities of current technology through talks, articles and demos for the general public; developing deep learning systems that detect deepfake videos; and releasing source code under ethical licenses.

Demo videos:

Related publications:

F. Paraperas Papantoniou, P.P. Filntisis, P. Maragos, and A. Roussos,
Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos,
International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Best Paper Finalist (top 33 out of 8,161 submissions).
[pdf (pre-print)] [video] [code]
G.K. Solanki and A. Roussos,
Deep Semantic Manipulation of Facial Videos,
European Conference on Computer Vision (ECCV), 4th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), October 2022.
[pdf (pre-print)] [video] [web]
M.C. Doukas, M.R. Koujan, V. Sharmanska, A. Roussos, and S. Zafeiriou,
Head2Head++: Deep Facial Attributes Re-Targeting,
IEEE Transactions on Biometrics, Behavior, and Identity Science (IEEE T-BIOM), Volume 3, Issue 1, pp 31-43, January 2021.
[pdf (preprint)] [code & data]
M.R. Koujan, M. Doukas, A. Roussos and S. Zafeiriou.
Head2Head: Video--based Neural Head Synthesis,
IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020.
[pdf] [web]

Popular science article about our research:

M. Anderson,
Altering Emotions in Video Footage With AI,
Unite.ai, December 2021 .

Large-scale Face Modelling

The human face is one of the most commonly-considered objects in Computer Vision and Computer Graphics. Modelling and reconstructing the detailed 3D shape, appearance and dynamics of the human face has numerous applications, such as augmented reality, performance capture, computer games, visual effects, human-computer interaction, computer-aided craniofacial surgery and facial expression recognition, to name a few.

My collaborators and I have created the so-called Large Scale Facial Model (LSFM) - a 3D Morphable Model (3DMM) of human faces automatically constructed from around 10,000 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed. It contains statistical information from a very wide variety of the human population, which allowed us to construct not only a global 3DMM model but also bespoke models tailored for specific age, gender or ethnicity groups. To build such a large-scale model we developed a novel fully-automated and robust Morphable Model construction pipeline.

More recently, we have expanded our large-scale modelling to the dynamics of human faces by introducing MimicMe, a novel large-scale database of dynamic high-resolution 3D faces. MimicMe contains recordings of 4,700 subjects with a great diversity on age, gender and ethnicity. The recordings are in the form of 4D videos of subjects displaying a multitude of facial behaviours, resulting to over 280,000 3D meshes in total. Based on this database, we have built very powerful blendshapes for parameterising facial behaviour.

Demo videos:

Video demonstrating our Large Scale Facial Model (LSFM)

Related publications:

A. Papaioannou, B. Gecer, S. Cheng, G. Chrysos, J. Deng, E. Fotiadou, C. Kampouris, D. Kollias, S. Moschoglou, K. Songsri-In, S. Ploumpis, G. Trigeorgis, P. Tzirakis, E. Ververas, Y. Zhou, A. Ponniah, A. Roussos and S. Zafeiriou,
MimicME: A Large Scale Diverse 4D Database for Facial Expression Analysis,
European Conference on Computer Vision (ECCV), October 2022.
[pdf] [web]
J. Booth, A. Roussos, A. Ponniah, D. Dunaway, and S. Zafeiriou,
Large Scale 3D Morphable Models,
International Journal of Computer Vision (IJCV), Volume 126, Issue 2-4, pp 233-254, April 2018.
[pdf] [code & model]
J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway,
A 3D Morphable Model learnt from 10,000 faces,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016.
[pdf] [code & model]

Popular science article about our research:

M. Hutson,
Computer scientists have created the most accurate digital model of a human face. Here’s what it can do,
Science Magazine, 1 May 2017 .

3D Face Reconstruction

My collaborators and I have introduced methods that can accurately and robustly fit morphable face models like our LSFM model to a very wide range of real-world images and videos taken by even a simple commodity camera and capturing the face of a human moving, talking and making expressions. More recently, we have also developed deep learning frameworks based on Convolutional Neural Networks (CNNs) that regress the 3D facial mesh from RGB input. These are trained using very large-scale datasets for training (more than 10,000 videos with more than 2,000 unique identities), relying on pseudo-annotations produced by our robust traditional model fitting approaches.

Demo videos:

Results of Speech-Aware 3D Face Reconstruction in Videos
Results of 3D Reconstruction of “In-the-Wild” Faces in Images and Videos
Results of 4D Face Reconstruction by combining Dense NRSfM and 3DMMs:
- video 1
- video 2
CNN-based method for Real-time 4D Face Reconstruction (work in progress)
Examples from our very large-scale datasets with pseudo-annotated 3D face reconstructions

Related publications:

P.P. Filntisis, G. Retsinas, F. Paraperas Papantoniou, A. Katsamanis, A. Roussos and P. Maragos,
Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos,
IEEE Computer Vision and Pattern Recognition Conference (CVPR), 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), June 2023.
[pdf (pre-print)] [video] [web] [code]
J. Booth, A. Roussos, E. Ververas, E. Antonakos, S. Ploumpis, Y. Panagakis, and S. Zafeiriou,
3D Reconstruction of "In-the-Wild" Faces in Images and Videos,
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), Volume 40, Issue 11, pp 2638-2652, Nov. 2018.
[article] [code & data]
M.R. Koujan and A. Roussos.
Combining Dense Nonrigid Structure from Motion and 3D Morphable Models for Monocular 4D Face Reconstruction,
The 15th ACM SIGGRAPH European Conference on Visual Media Production (CVMP), London, UK, December 2018.
[pdf] [web]
J. Deng, A. Roussos, G. Chrysos, E. Ververas, I. Kotsia, J. Shen, and S. Zafeiriou,
The Menpo Benchmark for Multi-pose 2D and 3D Facial Landmark Localisation and Tracking,
International Journal of Computer Vision (IJCV), Volume 127, Issue 6-7, pp 599-624, June 2019.
[pdf]
S. Zafeiriou*, G. Chrysos*, A. Roussos*, E. Ververas, J. Deng, and G. Trigeorgis, *Joint first authorship
The 3D menpo facial landmark tracking challenge,
International Conference on Computer Vision 3D Menpo Facial Landmark Tracking Challenge Workshop, Venice, Italy, October 2017.
[pdf] [Challenge webpage]
M.R. Koujan, A. Roussos and S. Zafeiriou.
DeepFaceFlow: In-the-Wild Dense 3D Facial Motion Estimation,
International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[pdf] [web]
M.R. Koujan, L. Alharbawee, G. Giannakakis, N. Pugeault and A. Roussos.
Real-time Facial Expression Recognition ``In The Wild'' by Disentangling 3D Expression from Identity,
IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020.
[pdf] [web]

Facial Expression Recognition

We are very interested in applying 3D face modelling on human emotion recognition, an increasingly popular line of research that plays a central role in the field of Affective Computing. Analysing the facial expressions is one of the most important non-intrusive ways to infer human emotions. Solving this problem successfully is immensely beneficial for a plethora of applications, e.g. human-computer intelligent interaction, stress analysis for medical research, robotic assistants for children with autism, interactive computer games, emotions transfer, etc.

Towards that goal, we recently constructed a large scale dataset of facial videos (more than 10,000 YouTube videos of people talking), rich in facial dynamics, identities, expressions, appearance and 3D pose variations. We used this dataset to train a deep CNN for estimating expression parameters of a 3D Morphable Model and combine it with an effective back-end emotion classifier. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation and accurately recognizing facial expressions from in-the-wild images. In addition, we have been studying the adoption of similar deep learning frameworks for the closely-related problem of analysing the levels of stress from facial videos.

Demo videos:

Applying our Real-time Facial Expression Recognition method on a video

Related publications:

M.R. Koujan, L. Alharbawee, G. Giannakakis, N. Pugeault and A. Roussos.
Real-time Facial Expression Recognition ``In The Wild'' by Disentangling 3D Expression from Identity,
IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020.
[pdf] [web]
G. Giannakakis*, M.R. Koujan*, A. Roussos, and K. Marias, *Joint first authorship
Automatic stress analysis from facial videos based on deep facial action units recognition,
Pattern Analysis and Applications: 1-15, Sep 2021.
[article]
G. Giannakakis, M.R. Koujan, A. Roussos, and K. Marias.
Automatic Stress Detection Evaluating Models of Facial Action Units,
IEEE International Conference on Automatic Face and Gesture Recognition (FG), 1st Workshop on Faces and Gestures in E-health and Welfare (FaGEW), 2020.
[pdf]

Sign Language Recognition and Synthesis

Around 70 million Deaf worldwide use Sign Languages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Sign Language technologies can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in Sign Language and translation between sign and spoken language.

My collaborators and I are very interested in developing pioneering methodologies for Automatic Sign Language Recognition, Sign Language Translation, as well as photo-realistic Sign Language Synthesis. Towards these goals, we have introduced new methods and datasets for continuous sign language recognition and have been studying the importance of non-manual signals (eye gaze, facial expressions, body motions, etc) in sign language recognition and synthesis.

Demo videos:

Related publications:

C.O. Tze, P.P. Filntisis, A.L. Dimou, A. Roussos and P. Maragos,
Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting,
IEEE Computer Vision and Pattern Recognition Conference (CVPR), AI for Content Creation Workshop (AI4CC), June 2023.
[pdf (pre-print)] [video]
C.O. Tze, P.P. Filntisis, A. Roussos and P. Maragos,
Cartoonized Anonymization of Sign Language Videos,
IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), June 2022.
[pdf] [video]
D. Kosmopoulos, I. Oikonomidis, K. Konstantinopoulos, N. Arvanitis, K. Antzakas, A. Bifis, G. Lydakis, A. Roussos and A.A. Argyros.
Towards a visual Sign Language dataset for home care services,
IEEE International Conference on Automatic Face and Gesture Recognition (FG), Special session on Challenges in Modeling and Representation of Gestures in Human Interactions, 2020.
[pdf]
A. Roussos, S. Theodorakis, V. Pitsikalis and P. Maragos,
Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos,
In S. Escalera, I. Guyon, and V. Athitsos, editors, Gesture Recognition, pages 231--271. Springer International Publishing, Cham, 2017.
[SpringerLink]
E. Antonakos*, A. Roussos*, S. Zafeiriou*, *Joint first authorship
A Survey on Mouth Modeling and Analysis for Sign Language Recognition,
IEEE International Conference on Automatic Face and Gesture Recognition (FG'15), Ljubljana, Slovenia, May 2015.
[pdf]
A. Roussos, S. Theodorakis, V. Pitsikalis and P. Maragos,
Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos,
Journal of Machine Learning Research (JMLR), 14(Jun):1627-1663, 2013.
[pdf] [web]
A. Roussos, S. Theodorakis, V. Pitsikalis and P. Maragos,
Affine-Invariant Modeling of Shape-Appearance Images applied on Sign Language Handshape Classification,
Proc. IEEE Int'l Conf. on Image Processing (ICIP), Hong Kong, Sep. 26-29, 2010.
[pdf]
A. Roussos, S. Theodorakis, V. Pitsikalis and P. Maragos,
Hand Tracking and Affine Shape-Appearance Handshape Sub-Units in Continuous Sign Language Recognition,
Workshop on Sign, Gesture and Activity, 11th European Conference on Computer Vision (ECCV), Crete, Greece, September 2010.
[pdf]

Shallow Seafloor Mapping from Drone Images

The coastal seafloor on the SouthEast Mediterranean Sea is supporting an increasing number of human development activities and at the same time faces various threats such as sea level rise, coastal erosion and the introduction of invasive species due to ocean warming. Therefore, adequate marine planning requires the employment of advanced methods for mapping shallow seafloor efficiently and in detail.

In collaboration with remote sensing scientists from IMS-FORTH, we are developing an integrated methodology for shallow bathymetry retrieval and detailed mapping of coastal benthic cover of the Cretan shoreline, the transparent waters of which are welcoming for studies with optical imagery. We are developing novel methods for seafloor mapping from multi-temporal, multi-spectral imagery with unmanned aerial vehicles (UAV). In addition we deploy measurements from unmanned surface vehicles (USV) as training data. We are studying the application of state-of-the-art image analysis, neural networks and machine learning techniques for training a system for automated, high resolution 3D reconstruction of coastal seafloor.

Related publications:

E. Alevizos, V.C. Nicodemou, A. Makris, I. Oikonomidis, A. Roussos and D.D. Alexakis,
Integration of Photogrammetric and Spectral Techniques for Advanced Drone-Based Bathymetry Retrieval Using a Deep Learning Approach,
Remote Sensing, 14(17), 4160, 2022.
[pdf]
E. Alevizos, A. Makris, I. Oikonomidis, A. Roussos and D. D. Alexakis,
Integrating spectral and multi-view features from drone-based imagery for effective shallow bathymetry retrieval,
In Remote Sensing of the Ocean, Sea Ice, Coastal Waters, and Large Water Regions 2021, vol. 11857, p. 1185704. SPIE, 2021.
[slides]
E. Alevizos, A. Roussos and D.D. Alexakis,
Geomorphometric analysis of nearshore sedimentary bedforms from high-resolution multi-temporal satellite-derived bathymetry,
Geocarto International, pp.1-18, November 2021.
[pdf]

For more details, please visit the website of ACTYS project.

Dynamics and Object Priors for Optical Flow Estimation

Optical flow is a fundamental problem of Computer Vision that seeks to estimate pixel’s displacements between two images of the same scene. Optical flow in the presence of non-rigid deformations is a challenging task and an important problem that continues to attract significant attention from the Computer Vision community. It can play a significant role in a wide variety of problems such as medical imaging, dense non-rigid 3D reconstruction, dense 3D mesh registration, motion segmentation, video re-texturing, super-resolution, facial expression recognition, facial tracking, facial animation and reenactment.

We are interested in generalizations of the traditional optical flow problem, for example by using a large number of images (frames) over long sequences, by incorporating prior knowledge about the observed objects (faces) and/or by estimating the 3D motion field from pairs of monocular images. Towards that goal, we have developed novel algorithms that adopt dense variational formulations, model-based and low rank priors, as well as CNN-based frameworks to solve the relevant problems robustly and accurately.

Demo videos:

Related publications:

M.R. Koujan, A. Roussos and S. Zafeiriou.
DeepFaceFlow: In-the-Wild Dense 3D Facial Motion Estimation,
International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[pdf] [web]
P. Snape, A. Roussos, Y. Panagakis, S. Zafeiriou.
Face Flow,
International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015.
[pdf]
R. Garg, A. Roussos, L. Agapito,
A Variational Approach to Video Registration with Subspace Constraints,
International Journal of Computer Vision (IJCV), vol. 104, issue 3, pp. 286-314, Sept. 2013,
[pdf] [web] [code]
R. Garg, A. Roussos and L. Agapito,
Robust Trajectory-space TV-L1 Optical Flow for Non-rigid Sequences,
Proc. 8th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), Saint Petersburg, Russia, July 25-27, 2011.
[pdf] [ground truth & videos]
Y. Zhou, E. Antonakos, J. Alabort-i-medina, A. Roussos, S. Zafeiriou.
Estimating Correspondences of Deformable Objects "In-the-wild",
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016.
[pdf] [description & data]

Dense 3D Reconstruction of Deformable Objects from Videos

My collaborators and I have developed novel methodologies for dense dynamic 3D reconstruction of non-rigid scenes. We introduced the first variational approach to the problem of dense 3D reconstruction of non-rigid surfaces from a monocular video sequence (oral at CVPR 2013). Beyond that, we developed a robust and accurate energy-minimization method for non-rigid video registration (multi-frame optical flow) that provides dense estimation of long-term 2D trajectories, using subspace constraints and convex optimization methods for Computer Vision (IJCV 2013). In addition, we proposed the first algorithm in the literature that solves the problem of simultaneous motion segmentation, motion estimation and dense 3D reconstruction from videos taken with a single hand-held camera and capturing multiple independently moving objects (oral at ISMAR 2012). Note that our aforementioned methodologies are generic and use as only input the video from a single camera, without requiring any markers, additional sensors or prior knowledge about the type of object(s) in the scene.

Demo videos:

Results of our Dense 3D Reconstruction of Deformable Objects

Related publications:

R. Garg, A. Roussos, L. Agapito,
Dense Variational Reconstruction of Non-Rigid Surfaces from Monocular Video,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, Oregon, USA, June 23-28, 2013.
[pdf] [demo video]
A. Roussos, C. Russell, R. Garg, L. Agapito,
Dense Multibody Motion Estimation and Reconstruction from a Handheld Camera,
IEEE Int’l Symposium on Mixed and Augmented Reality (ISMAR), Atlanta, Georgia, USA, November 5-8, 2012.
[pdf] [web]
R. Garg, A. Roussos, L. Agapito,
A Variational Approach to Video Registration with Subspace Constraints,
International Journal of Computer Vision (IJCV), vol. 104, issue 3, pp. 286-314, Sept. 2013,
[pdf] [web] [code]