All warfare is based on deception. --Sun Tzu
Much has been in the news about data breaches and cybersecurity… in addition to these topics is the understanding that manipulation is another threat to your security. In my first blog ever, I am discussing the technology of deep fakes and false news, and how this new branch of data science can bias your thoughts and actions and threaten your security. I will talk about four basic technologies used in deep fake technology.
The first step in protecting yourself in today’s technology-laden media world is to know what is possible… and truly the sky is the limit. Although I encourage you to discover the wealth of articles out there, let’s just start with the basics for now…
Screen shot taken from Face2Face: Real-time Face Capture and Reenactment of RGB Videos (CVPR 2016 Oral) (https://www.youtube.com/watch?v=ohmajJTcpNk)
This video was originally presented as a paper by five researchers from the University of Erlangen-Nuremberg, Max Planck Institute for Informatics and Stanford University. Reviewing the original whitepaper (http://niessnerlab.org/papers/2016/1facetoface/thies2016face.pdf) demonstrates the mathematics and readily available computing power that went into the technology. It was and is quite an impressive feat, utilizing a standard off-the-shelf webcam to modify an existing video stream. A new and convincing video can be produced.
Taken from Face2Face: Real-time Face Capture and Reenactment of RGB Videos white paper (http://niessnerlab.org/papers/2016/1facetoface/thies2016face.pdf)
Screen shot taken from [ICCV 2019] FSGAN: Subject Agnostic Face Swapping and Reenactment (https://www.youtube.com/watch?v=BsITEVX6hkE)
Another technology presented as a whitepaper FSGAN: Subject Agnostic Face Swapping and Reenactment (https://arxiv.org/pdf/1908.05932.pdf) by three researchers; Yuval Nirkin, Yosi Keller and Tal Hassner for the International Conference on Computer Vision in 2019. This advancement applies techniques that do not require pre-training for the system to work. Meaning, “the method eliminates laborious, subjects specific, data collection and model training, making face swapping and reenactment accessible to non-experts”1. The results can be generated much more quickly and with almost imperceivable changes to the destination video stream.
At the time the system was developed, the process required a sophisticated network of AI-based neural networks that were not readily available to the public. As the technology progresses these types of neural networks are now available and can be accomplished on your smartphone.
Animation from Speech
The next technology was presented in the original research document “Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion” (https://research.nvidia.com/publication/2017-07_Audio-Driven-Facial-Animation) by five authors; Tero Karras, Timo Aila, Samuli Laine, Antti Herva (Remedy Entertainment), Jaakko Lehtinen (NVIDIA and Aalto University). It was originally designed for video games… “expressive facial animation is an essential part of modern computer-generated movies and digital games”2. This system requires a complex neural network for analyzing and processing the audio stream to map the characters facial motion correctly. The critical key to note about this technology is that the audio drives the digital performance. Animators are no longer required to “paint” the facial movements to match to the spoken dialog. Now, we are free to just recite text and the computer will apply the words, cadence, and emotional performance to a digital character.
Screen shot taken from This AI Clones Your Voice After Listening for 5 Seconds! (https://www.youtube.com/watch?v=0sR1rU3gLzQ&t=101s)
The basis for this video and technology was written in “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” (https://arxiv.org/abs/1806.04558) by eleven authors: Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. The method described is quite impressive. The program is “able to synthesize natural speech from speakers that were not seen during training,”3 this shows that the system is able to generate consonants and sounds not spoken in the original sample. From a short five second clip, the algorithm can generate the pitch and inflections to match the original sample. To me, the use case of this type of technology is less clear. The perverse use is much more apparent, but I will let the reader / listener be their own judge.
The idea that what your reading, seeing, and hearing all can be modified and manipulated is something that needs to be seriously considered when seeking or gathering information. It is one thing to use the technology discussed in the context of Hollywood or for entertainment purposes, but when these tools and techniques are obtained for nefarious reason, it can have serious consequences.
The takeaway from all of this is that anyone can now do it. What does this truly mean? The entertainment industry can fix movie overdubbing, create new performances for existing footage, or dishonest parties can easily falsify news streams or videos. Be wary—don’t believe everything you see and hear…
1. Page 8, “FSGAN: Subject Agnostic Face Swapping and Reenactment”; Y.Nirkin (Bar-Ilan University, Israel), Y.Keller (Bar-Ilan University, Israel), T.Hassner (The Open University of Israel, Israel)
2. Page 1, “Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion”; T. Karras (NVIDIA, Inc.), T. Aila (NVIDIA Inc.), S. Laine (NVIDIA Inc.), A. Herva (Remedy Entertainment), J. Lehtinen (NVIDIA Inc. and Aalto University)
3. See Cornell University, Computer Science- Computation and Language URL: https://arxiv.org/abs/1806.04558 for excerpt.