Gesthimani Roumpani
REAL FICTIONS
UCLA AUD IDEAS ENTERTAINMENT STUDIO 2019-20 Research Seminar with Natasha Sandmeier
CONTENTS
Chapter i - This Person Is Just a StyleGAN......................................2 To GAN or to StyleGAN..............................................................4 Thispersondoesnotexist.com......................................................4 Fake-spotting...............................................................................7 Observations comparison..........................................................10 (In lieu of) Conclusion................................................................10 References.................................................................................12 Chapter ii - Toward a Hyperfake-ism..............................................14 Ephemeral faces........................................................................16 Few-shot deepfakes...................................................................16 What comes next?....................................................................18 Photoreal or hyperreal?.............................................................19 The human factor.......................................................................20 References.................................................................................24 Chapter iii - Regarding Vision; biological & digital Vision of humans........................................................................26 Vision of animals........................................................................30 ...and ants.................................................................................31 Computers & dimensions..........................................................32 The architecture of Convolutional Neural Networks.................36 Conclusion.................................................................................39 References.................................................................................40
THIS PERSON IS JUST A STYLEGAN Gesthimani Roumpani
One of 2019's most successful websites, thispersondoesnotexist.com, visualizes the latest advancement in the Generative Adversarial Network technology (GAN). The bold move to use the most familiar medium, the human face, as a way to communicate this technology's capabilities, is an attempt to educate the general public on the extent to which GANs can go and raise awareness about media literacy and evaluating sources of information. Using this website as a tool, and considering the speed with which the neural network keeps improving its results, we examine clues to distinguish fiction from reality in an experiential way that is still pertinent to the current results.
One of 2019's most successful websites1, thispersondoesnotexist.com, visualizes the latest advancement in the Generative Adversarial Network technology (GAN)2. The bold move to use the most familiar medium, the human face, as a way to communicate this technology's capabilities, is an attempt to educate the general public on the extent to which GANs can go and raise awareness about media literacy and evaluating sources of information3. Using this website (https://thispersondoesnotexist.com/) as a tool, and considering the speed with which the neural network keeps improving its results, we examine clues to distinguish fiction from reality in an experiential way that is still pertinent to the current results. To GAN or to StyleGAN Generative Adversarial Networks (GANs) first appeared on arXiv in June 2014 as a Machine Learning concept4, describing the idea of simultaneously training two models pitted against each other; generative model G and discriminative model D. In a nutshell, model G is fed data, then generates new content based on that data, and model D evaluates whether the new content was from the original data pool or generated by G. Successful results are those which D cannot identify as constructed by G5. For 4
more details on GANs and how they work, take a look at chapter "Introduction and Theory of GAN", by Ruohan Yang (2019). To serve the purpose of the current chapter, we will take a look at a specific type of GAN called StyleGAN, through a preprint first posted by NVIDIA researchers in December 2018 and revised in March 20196. Tero Karras, Samuli Laine, and Timo Aila (2018) explain how "intuitive, scale-specific control of the synthesis" (p. 1) can be achieved via separation of high-level attributes and stochastic variation. Identifying the weakness of GANs to generate convincing results despite the rapid improvements, they re-designed the generator to operate with more control over the gradual and final synthesis. It starts with a very low resolution and coarse characteristics, such as pose and face shape, continues to a middle-level resolution and further facial features such as hair style and eyes, and concludes with the finer resolution that affects color schemes and detailed characteristics. As a result, the output's scale is controlled more effectively compared to the original GAN methods7. Thispersondoesnotexist.com In February 2019, NVIDIA released their open-source code to the public8. While for the non-tech population this was no tangible news, for software This Person Is Just a StyleGAN
GAN-generated faces, 2014 (Figure 2)
GAN-generated faces, 2017 (Figure 3)
StyleGAN-generated faces, 2018 (Figure 4)
Gesthimani Roumpani
5
HAVE YOU SEEN THIS PERSON?* *No. Because she does not exist.
engineer Philip Wang it was an opportunity to help the general public visualize the capabilities of GAN9. After all, a github code in its raw form is only useful to those who can read it. To achieve that, he created the website thispersondoesnotexist. com, where faces of non-existent strangers are generated every time the page gets refreshed. Even though engineers have experimented with other StyleGAN - generated results from that same source code10, the blunt choice of the human face to depict the neural network's capabilities is deliberate. Considering the effortlessness with which Wang's website generates convincing results that can sometimes deceive human perception, the website serves the author's purpose to "raise public awareness for this technology. Faces are most salient to our cognition, so I've decided to put that specific pretrained model up" (Philip Wang, 2019)11. Fake-spotting Before StyleGAN, distinguishing a computer-generated face was an easy task due to peculiar characteristics beyond the GAN's control. With StyleGAN the results improved rapidly. While the technology is developing and, by definition, a self-training algorithm can only get better, there are still effective techniques to identify a fabricated face from a Gesthimani Roumpani
real one. Another website called "Which Face Is Real" (http://whichfaceisreal.com/index.php) is challenging its visitors to play a game; see two photos of different people and select the real one. Inspired by thispersondoesnotexist.com, the creators Jevin West and Carl Bergstrom (2019) aim to take the awareness effort one step further by training the human eye to spot the fake pictures at a glance. As they notably mention, "[...] it may be only a few years until humans fall behind in the arms race between forgery and detection". However, they do point out that our visual perception is still better than the computer's at spotting fakes "at least for the time being"12.
"It may be only a few years until humans fall behind in the arms race between forgery and detection." There are online examples13,14 in the form of blog posts or articles that have been evolving alongside the GAN technology and identifying the weak spots that serve as fake giveaways. Con7
sidering that GAN improvement speed can outrun any effort to document flaws, for the purpose of this chapter we have generated 100 images from our case study website and then identified peculiar details that reveal the true nature of these portraits. It is in no way implied that 100 images is a sufficient amount to reach definitive conclusions about the capability of the technology. This experiential methodology is rather employed as a way to compare personal findings to other individuals who have attempted something similar in the past. Please note that this method was last tested on November 17, 2019. Any advancements since then might have already rendered some of the following observations obsolete. Background (Figures 6 & 7) - Faces generated with an abstract, blurry, or unidentifiable background are more convincing than those that attempt to represent crispier details. More specifically, two different effects were identified; the "Monet" effect, and the UVW texture map effect. The former is named as such due to the brush stroke-like effect that resembles Claude Monet's paintings. The latter looks like a UVW texture map generated by a 3D scanned object, for those familiar with the process. Teeth (Figure 7) - Another common mishap appears to be the teeth when the face allows for 8
them to show. Besides the jarred edges or occasional striking discolorations, there seems to be a pattern of failing to replicate the natural placement and types of teeth (eg. more incisors in place of canines). Glasses (Figure 8) - When it comes to eyeglasses, which in reality cause a distortion effect to the objects behind them, StyleGAN is not always successful. Images that have no distortion effect behind the glass (often the case for conditions like low astigmatism) achieve a more realistic result. In terms of the eyeglass frame, the results are impressive but not flawless, as asymmetry is sometimes observed. Ears (Figure 9) - This is probably the part of the face where the most asymmetry occurs. Additionally, ears are sometimes misplaced to incorrect age groups (eg a child with an adult's ears). Earrings (Figure 10) - Though accessories are a weak spot in general, the most notable is earrings. Not because of the asymmetry (as this could happen in real life), but because of unidentifiable shapes that often occur as blotches. Age spots (Figure 11) - For older faces there seems to be a weakness to correctly generate fine lines, wrinkles, and other inconspicuous age spots. This Person Is Just a StyleGAN
[October 2019]
YO U R G U IDE T O FAK E - S P O T T I NG
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Gesthimani Roumpani
9
Secondary characters (Figure 12) - In every picture where a second person is implied next to the main character, it is not recognized as a face but rather as an object, resulting in peculiar visual results.
the major issues observed then, though persistent, are not as problematic now. More specifically, hair and overall face asymmetry have improved dramatically toward becoming realistic.
Color bleed & blotches (Figure 13) - Bright colors do not appear very convincing, as they often bleed to neighboring objects. Moreover, blotches located at places where they could potentially represent lens glare are generated in such a way that are telling of the truth.
(In lieu of) Conclusion
Clothes (Figure 14) - Though usually not a focal point, sometimes the asymmetry and unrealistic texture of the clothes reveals the true nature of the picture.
After the brief explanation of StyleGAN and experiential observations on generated results, it is important to point out that currently the system produces these images once and then they are forever lost - unless someone saves them. For speculations on how this technology could advance, more findings are examined in the next chapter, in an attempt to predict the future of GAN in generating human faces.
Single hairs (Figure 15) - When imitating the effect of single hairs or small hair strands surrounding the face, hair and skin are blended in an unconvincing way. Hair accessories (Figure 16) Hats, hoods, beanies, all appear blended with hair oftentimes. This is better observed with dark colored accessories that could be confused with a hair color, rather than bright ones. Observations comparison Drawing from Kyle McDonald's article on medium.com15, last updated with the release of NVIDIA's arXiv paper on StyleGAN16, 10
This Person Is Just a StyleGAN
"Edmond de Belamy, from La Famille de Belamy", an AI-created painting that was auctioned for $421,000 (Figure 18).
References 1 Paez, D. (2019, February 13). This person does not exist is the best one-off website of 2019. Retrieved from https://www.inverse.com/article/53280-this-person-does- not-exist-ganswebsite 2 Karras, T., Laine, S., & Aila, T. (2019, March 29). A style-based generator architecture for Generative Adversarial Networks. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1812.04948. pdf 3 Paez, D. (2019, February 21). "This person does not exist" creator reveals his site's creepy origin story. Retrieved from https://www.inverse.com/ article/53414-this-person-does-not-exist-creator-interview 4 Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Benjio, Y. (2014, June 10). Generative Adversarial Nets. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1406.2661.pdf 5 Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Benjio, Y. (2014, June 10). Generative Adversarial Nets. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1406.2661.pdf 6 Karras, T., Laine, S., & Aila, T. (2019, March 29). A style-based generator architecture for Generative Adversarial Networks. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1812.04948. pdf 7 Karras, T., Laine, S., & Aila, T. (2019, March 29). A style-based generator architecture for Generative Adversarial Networks. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1812.04948. pdf
10 Khan, J. (2019, July 31). StyleGAN: Use machine learning to generate and customize realistic images. Retrieved from https://heartbeat. fritz.ai/stylegans-use-machine-learning-to-generate-and-customize-realistic-images- c943388dc672 11 Golt, M. (2019, February 14). This Website uses AI to generate the faces of people who don't exist. Retrieved from https://www.vice.com/ en_us/article/7xn4wy/this-website-uses-ai-togenerate-the-faces-of-people-who-dont-exist 12 West, J., & Bergstorm, C. (2019). Which face is real? Seeing through the illusions of a fabricated world. Retrieved from http://whichfaceisreal. com/bout.html 13 Mendis, A. (2019, April 3). Which face is real? Retrieved from https://www.kdnuggets. com/2019/04/which-face-real-stylegan.html 14 McDonald, K. (2018, December 13). How to recognize fake AI-generated images. Retrieved from https://medium.com/@kcimc/how-to-recognize-fake-ai-generated-images-4d1f6f9a2842 15 McDonald, K. (2018, December 13). How to recognize fake AI-generated images. Retrieved from https://medium.com/@kcimc/how-to-recognize-fake-ai-generated-images-4d1f6f9a2842 16 Karras, T., Laine, S., & Aila, T. (2019, March 29). A style-based generator architecture for Generative Adversarial Networks. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1812.04948. pdf .
8 Karras, T. (2019, February 8). StyleGAN - Official TensorFlow Implementation. Github. Retrieved from https://github.com/NVlabs/stylegan 9 Paez, D. (2019, February 21). "This person does not exist" creator reveals his site's creepy origin story. Retrieved from https://www.inverse.com/ article/53414-this-person-does-not-exist-creator-interview
12
This Person Is Just a StyleGAN
Images Figure 1 (p. 2-3 full spread background) Lukkarinen, S. (2019). Artificial 6 [painting]. Retrieved from https://www.picuki.com/media/2186276416370660485 Figure 2 Radford, A., Metz, L., & Chintala, S. (2016). More face generations from our Face DCGAN [digitally created by AI]. Retrieved from https:// arxiv.org/pdf/1511.06434.pdf . Figure 3 Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). 1024 Ă— 1024 images [digitally created by AI, using the CELEBA-HQ dataset]. Retrieved from https://arxiv.org/pdf/1710.10196. pdf Figure 4 Karras, T., Laine, S., & Aila, T. (2019). Coarse styles from source B [digitally created by AI]. Retrieved from https://arxiv.org/ pdf/1812.04948.pdf Figure 5 (p. 6 background image) [Portait generated by AI on thispersondoesnotexist.com]. (2019). Figure 6-17 [Portaits generated by AI on thispersondoesnotexist.com]. (2019). Figure 18 French art collective Obvious (2018). Edmond de Belamy, from La Famille de Belamy [Portait generated by AI]. Retrieved from https://www. thebigsmoke.com.au/2018/10/27/ai-generated-artwork-sells-for-432000-critics-suspect-plagiarism/
Gesthimani Roumpani
13
TOWARD A HYPERFAKE-ISM
Gesthimani Roumpani
The temporal nature of GAN-generated faces makes any future application ideas appear constrained to the restrictive nature of the single frame. A quick glance at a new technology associated with deepfakes (digitally fabricated videos of people saying things they never said) reveals that the next step for GANs is close and poses the question of what the next challenge is, after they have overcome their current setbacks. Is hyperrealism (or rather, hyperfake-ism) a plausible answer?
V] 25 Sep 2019
The temporal nature of GANgenerated faces makes any application ideas appear constrained to the restrictive nature of the single frame. A quick glance at a new technology1 associated with deepfakes (digitally fabricated videos of people saying things they never said) reveals that the next step for GANs is close and poses the question of what the next challenge is, after they have overcome their current setbacks. Is hyperrealism (or rather, hyperfake-ism) a plausible answer?
confirms the acceptance of the digital environment as something that coexists with physical reality and is often viewed within similar terms. The question, however, remains; what could we possibly do with a single frame generated by an AI system? A recent paper by Samsung researchers might have the answer. Few-shot deepfakes2 Deepfakes have come to both amuse and shock the world with their results3. Up until recently, it was necessary to have multiple shots/frames of the person you wanted to rig in order to create a deepfake, but a paper first posted on arXiv in May 2019 and
Ephemeral faces Despite any awareness that GAN faces are not actual people, it is still striking that, in the case of thispersondoesnotexist.com for instance, once the web page is refreshed the person stops existing in the digital world as well.
What could we Fascination with this lost data has possibly do with given birth to online communities and forums that collect GAN faca single frame es (like the thread called TPDNE, Few-Shot Adversarial Learning of Realistic Neural by Talking Head Models on reddit.com), which could be generated an compared to a digital graveEgorcomparison Zakharov possibly Aliaksandra Shysheya Egor Burkov Victor Lempitsky yard. This AI system? 1,2
1
source
1,2
Samsung AI Center, Moscow target
landmarks
2
1,2
1,2
Skolkovo Institute of Science and Technology
result
Results using face landmark from different Source Target tracks → Landmarks →videos Result to transform still images to talking heads (Figure 2).
source
Source
target
landmarks
result
Target → Landmarks → Result
Figure 1: The results of talking head image synthesis using face landmark tracks extracted from a different video sequence 16 Toward a Hyperfake-ism of the same person (on the left), and using face landmarks of a different person (on the right). The results are conditioned on the landmarks taken from the target frame, while the source frame is an example from the training set. The talking head
11
88
32
32
T
source Source
ground truth Ground truth
X2Face X2Face
Pix2pixHD Pix2pixHD
Samsung's Ours
comparison by The numbers onFor the left the number of we training frames. "Ground truth" is learning an imFigureResult 3: Comparison on Samsung. the VoxCeleb1 dataset. eachcolumn of theindicate compared methods, perform oneand few-shot age taken the actual video sequence, to compareor against the computer-generated results. rest offrames the columns on a video of afrom person not seen during meta-learning pretraining. We set the number ofThe training equalrepresent to T (the different methods used to generate a result. Top: Comparing Samsung's method to others' (Figure 3). Bottom: comparing different leftmost column). One (Figure of the training frames is shown in the source column. Next columns show ground truth image, taken Samsung methods 4). from the test part of the video sequence, and the generated results of the compared methods.
user study results. Cosine similarity, on the other hand, better correlates with visual quality, but still favours blurry, less 11 realistic images, and that can also be seen by comparing Table 1-Top with the results presented in Figure 3. While the comparison in terms of the objective metrics is inconclusive, the user study (that included 4800 triplets, each shown to 5 users) clearly reveals the much higher realism and personalization degree achieved by our method. We have also carried out the ablation study of our system and 88 the comparison of the few-shot learning timings. Both are provided in the Supplementary material.
siderably higher scores, compared to smaller-scale models trained on VoxCeleb1. Notably, the FT model reaches the lower bound of 0.33 for the user study accuracy in T = 32 setting, which is a perfect score. We present results for both of these models in Figure 4 and more results (including results, where animation is driven by landmarks from a different video of the same person) are given in the supplementary material and in Figure 1. Generally, judging by the results of comparisons (Table 1-Bottom) and the visual assessment, the FF model performs better for low-shot learning (e.g. one-shot), while the FT model achieves higher quality for bigger T via adversarial fine-tuning.
Large-scale results. We then scale up the available data and train our method on a larger VoxCeleb2 dataset. Here, Puppeteering results. Finally, we show the results for the we train two variants of our method. FF (feed-forward) puppeteering of photographs and paintings. For that, we variant is trained for 150 epochs without the embedding evaluate the model, trained in one-shot setting, on poses and, therefore, we only use it withmatching loss L MCH 32 32 from test videos of the VoxCeleb2 dataset. We rank these out fine-tuning (by simply predicting adaptive parameters videos using CSIM metric, calculated between the original Ďˆ via the projection of the embedding eĚ‚NEW ). The FT variimage and the generated one. This allows us to find perant is trained for half as much (75 epochs) but with LMCH , sons with similar geometry of the landmarks and use them which allows fine-tuning. We run the evaluation for both of ground truth - FFpuppeteering. Samsung's - FT can be seen Samsung's - FT 5 as for the The results in Figure these modelssource since they allow to trade off few-shot learningSamsung's Ours-FT Ours-FT T Source Ground truth Ours-FF before fine-tuning afterfine-tuning fine-tuning well as in Figure 1. fine-tuning speed versus the results quality. Both of them achieve conbefore after Gesthimani Roumpani
17
Figure 4: Results for our best models on the VoxCeleb2 dataset. The number of training frames is, again, equal to T (the leftmost column) and the example training frame in shown in7the source column. Next columns show ground truth image
revised in September 2019, claims to be able to generate a talking head by using only a few frames4.
than other talking head generating systems, as it is based on extensive pre-training on existing talking head videos.
It is bluntly stated in the abstract that the goal for this system is to be able to generate a talking head with "potentially even a single image" (Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V., 2019). The opening line of the paper references GAN-generated, highly realistic images, which hints at the intent of using ephemeral faces such as those from thispersondoesnotexist. com.
The same way GANs are used to generate single images of a random face, they are hereby employed to generate realistic poses of a given head. Subsequently, the original frame and the newly generated poses of that same, fake head, can be used to create a realistic talking head video. While results are compared to other methods and appear significantly improved (Figures 3 & 4), they still have way to go before reaching absolutely realistic results. However, considering the accelerated advancements in AI, one should expect that obstacles will be overcome sooner or later.
The paper explains the use of "face landmarks" (Figure 2) driving the animation of the face, playing the role of rigging components used in animation software. The "few-shot learning" method is significantly faster
What comes next?
TKTS line, 2005, Richard Estes (Figure 5)
18
Toward a Hyperfake-ism
If we consider the aforementioned paper a solution to finding an applicable use to the faces of thispersondoesnotexist.com, the possibilities for applications are open to the imagination. Avoiding to mention sinister uses of the technology5, as it will lead to an endless loop of pessimistic speculations, let us (for now) acknowledge the possibility of this new system working as a leap towards achieving a GAN realism that evades human perception. In that case, we can start making predictions of what the next step would be for those non-existent people. A possible answer? Hyperrealism. Photoreal or hyperreal? Photorealism is an American art movement that started in the 1960s6 as an art genre revolving around the accurate representa-
tion of a photograph or simulation of reality on a different medium7. Taking on the world of computer graphics, it is also used to describe computer-generated images that look real, or like a photograph (photorealistic renderings). At the point when photorealism becomes "too real"8, we might start talking about hyperrealism. Hyperrealism does not stop at carefully studying a photograph to produce an accurate representation. The artist employs ways to enhance the reality behind their subject with details that are otherwise inconspicuous, looking to provoke reactions and emotions from the audience, even when they face mundane objects9. In hyperrealism, the line between reality and fiction is sometimes blurred until the entire object is revealed with all
Mask II, 2011, Ron Muek (Figure 6)
Gesthimani Roumpani
19
its uncanny characteristics. As art historians Horst Bredekamp and Barbara Maria Stafford put it, Hyperrealism is defined by its "paradoxical existence" and "the revelation of the unexpected by magnification"10 (2006) . Understanding the distinctions between the above is important in the attempt to categorize GAN-generated faces. The dilemma we are facing is the following: They are generated by a computer system, and do not exist in reality. We see those faces as pictures, and our gut reaction is that those are actual people who, past the single shot we are seeing, are able (or were at some point able) to move, talk, think, breathe. However, they are not; thus they are deceiving the audience in terms of their actual scale and use, like hyperrealism. Synced, a popular online source for AI and technology review / commentary, has used the term "hyperrealistic" to describe the faces generated after the release of NVIDIA's StyleGAN11.
text we are considering GAN faces to belong within the realistic realm. And this is why we can speculate that, similarly to the evolution of art, the next step to be taken will be hyperreal representation of human faces that can propagate desired emotions from the targeted audience.
The GANs' ability to generate hyperreal faces could be the subsequent step for implementing these powerful tools in daily life. The human factor
Hence, for the purpose of this
A potentially popular use of GAN faces would be the ability of advertising companies to find the most suitable face for promoting their products, without the need of hiring real models. Especially the ability to do so with someone that does not just look normal or real, but hyperreal, is already being put under test with CGI fictional characters taking social media by storm12, 13 and the people / companies behind
20
Toward a Hyperfake-ism
Looking at hyperrealistic art, however, it is clearly associated with high definition, stylization, and depiction of detail that seems logical but is not normally observed at a glance. Furthermore, those GAN faces are not intended to generate or propagate certain emotions; all they are trying is to fool us.
them making significant profit, or promoting specific agendas14. Lil Miquela (@lilmiquela, Instagram 2019, ranked in TIME magazine's 25 most influential people of the internet in 201815), Shudu (@ shudu.gram, Instagram 2019),
Blawko (@blawko22, Instagram 2019), and Bermuda (@bermudaisbae, Instagram 2019) are just a few examples to mention. These characters have been experimenting with various types of
The "The world's first digital supermodel"16, Shudu. A CGI model and instagram influencer, Shudu was created by photographer Cameron-James Wilson (Figure 7).
Gesthimani Roumpani
21
promotion, and their whole digital existence revolves around a specific detailed narrative about their lives. Not only is the audience having no trouble accepting an admittedly not real person, but they are even sympathising with their emotional posts or troubles of the daily "life". Using Lil Miquela as a case study, the early confusion under her posts about why she looks so perfect, polished, and is seemingly interacting with actual people, is what raised the hype and her popularity. Case in point; the GANs' ability to generate hyperreal faces, combined with the evolution of applications such as the few-shot Adversarial learning could be the subsequent step for implementing these incredibly powerful tools in daily life.
22
Toward a Hyperfake-ism
Blawko, another CGI instagrammer, who always hides part of his face in his pictures (Figure 8).
Gesthimani Roumpani
23
References 1 Solsman, J. E. (2019, May 24). Samsung deepfake AI could fabricate a video of you from a single profile pic. Retrieved from https://www. cnet.com/news/samsung-ai-deepfake-can-fabricate-a-video-of-you-from-a-single-photo-monalisa-cheapfake-dumbfake/ 2 Zakharov, E., Shysheya, A., Burkov, E., & Lempitsky, V. (2019, September 25). Few-Shot adversarial learning of realistic neural talking head models. ArXiv e-prints. Retrieved from https://arxiv.org/pdf/1905.08233.pdf 3 Leon, H. (2019, June 12). Why AI Deepfakes should scare the living bejeezus out of you. Retrieved from https://observer.com/2019/06/ ai-deepfake-videos-mark-zuckerberg-joe-rogan/ 4 Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V. (2019, September 25). Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. ArXiv e-prints. Retrieved from https:// arxiv.org/pdf/1905.08233.pdf 5 Satter, R. (2019, June 13). Experts: Spy used AI-generated face to connect with targets. Retrieved from https://apnews.com/bc2f19097a4c4fffaa00de6770b8a60d 6 Wainwright, L. S. (2014, July 24). Photo-realism. Retrieved from https://www.britannica.com/art/ Photo-realism 7 Kobe, D. (2019, March 21). Photorealism, Photography, and a Generational Change in Perspective. Retrieved from https://medium.com/@dmkobe94/photorealism-instagram-and-a-generational-change-in-perspective-3d91b4c87641
11 Synced AI Technology And Industry Review (2018, December 14). GAN 2.0: NVIDIA’s hyperrealistic face generator. Retrieved from https:// syncedreview.com/2018/12/14/gan-2-0-nvidias-hyperrealistic-face-generator/ 12 Condon, O. (2018, December 25). Meet the CGI influencers that are fooling everyone on Instagram. Retrieved from https://www. dailyedge.ie/cgi-influencers-robots-instagram-4407390-Dec2018/ 13 Du Parq, A., & London, B. (2018, September 13). The man behind Shudu Gram & the world's first 'digital supermodels' reveals the secrets behind his stratospheric success. Retrieved from https://www.glamourmagazine.co.uk/article/shudu-gram-virtual-supermodels 14 Yurieff, K. (2018, June 25). Instagram star isn't what she seems. But brands are buying in. Retrieved from https://money.cnn. com/2018/06/25/technology/lil-miquela-social-media-influencer-cgi/index.html 15 Time Staff (2018, June 30). The 25 Most Influential People on the Internet. Retrieved from https://time.com/5324130/most-influential-internet/ 16 Alti, A. (2018, December 10). Interview With Cameron-James Wilson, Creator of Shudu, the world’s first digital supermodel. Retrieved from https://wersm.com/ interview-with-cameron-james-wilson-creator-of-shudu-the-worlds-first-digital-supermodel/
8 Lansroth, B. (2015, November 28). Photorealism in Art - A Debated Style. Retrieved from https:// www.widewalls.ch/photorealism-art-style/ 9 Amoakoh, S. (n.d.). What is Hyperrealism? Retrieved from https://useum.org/hyperrealism/ what-is-hyperrealism 10 Bredekamp, H., & Stafford, B. M. (2006, January 1). One step beyond Hyperrealism. Tate etc, 5. Retrieved from https://www.tate.org.uk/ tate-etc/issue-5-autumn-2005/one-step-beyond
24
Toward a Hyperfake-ism
Images Figure 1 (p. 14-15 full spread background) Youtube (2018), Baauer & Miquela - Hate Me [video screenshot]. Retrieved from https:// www.insider.com/cgi-influencers-what-are-theywhere-did-they-come-from-2019-8 Figure 2-4 Zakharov, E., Shysheya, A., Burkov, E., & Lempitsky, V. (2019). [digital illustrations]. Retrieved from https://arxiv.org/pdf/1905.08233.pdf Figure 5 Estes, R. (2005). TKTS Line [painting]. Retrieved from https://www.escapeintolife.com/painting/ richard-estes/ Figure 6 Muek, R. (2011). Mask II [sculpture]. Retrieved from https://www.theatlantic.com/ photo/2013/10/the-hyperrealistic-sculptures-of-ron-mueck/100606/ Figure 7 Wilson, C. J. (2018). Shudu [CGI]. Retrieved from https://www.instagram.com/p/ Bf37kZzFZ42/ Figure 8 Machine-A (2019). Blawko [CGI]. Retrieved from https://www.machine-a.com/blogs/journal/machine-a-x-blawko-interview
Gesthimani Roumpani
25
REGARDING VISION; biological & digital Gesthimani Roumpani
The ability of GANs to generate realistic original content is indisputable. Sinister or auspicious scenarios aside, the mere ability of neural networks to produce that content is fascinating. Inspired by a real discussion, it became apparent that not everyone is aware of what a breakthrough that is, which takes away from the purpose of the aforementioned websites to raise awareness about the ability of GANs. This chapter explains and compares various types of vision, in an attempt to illustrate the complexity of computer vision and establish how big a leap neural networks have taken towards visualization and content creation.
The ability of GANs (Generative Adversarial Networks - for more, see chapters "Introduction and Theory of GAN" by Ruohan Yang, and "This Person Is Just a StyleGAN" by Gesthimani Roumpani) to generate realistic original content is indisputable. Sinister or auspicious scenarios aside, the mere ability of neural networks to produce that content is fascinating. Inspired by a real discussion, it became apparent that not everyone is aware of what a breakthrough that is, which takes away from the purpose of the aforementioned websites to raise awareness about the ability of GANs. This chapter explains and compares various types of vision, in an attempt to illustrate the complexity of computer vision and establish how big a leap neural networks have taken towards visualization and content creation. Vision of humans As we begin to examine the way
vision works, we will start with the most familiar of systems; the human vision. Even though our world is 3-dimensional, each eye perceives the scene it is looking at as 2-dimensional. Due to the difference in our eyes’ location, each eye records a slightly different image (Figure. 2). Once these images are combined to generate the final result, their difference is interpreted as depth, and thus results in the 3-dimensional translation of the world as we know it (Figure 3). This is called stereoscopic vision, and works for approximately 18 feet - after that, our brain uses relative scale to determine the depth of field1. The images are all generated through the decoding of light rays that enter the eye and end up on a focusing point on the retina, which plays a role equivalent to that of the film in a camera2. The retina is populated with two types of photoreceptor cells, called rods and cones, the former of which are responsible for low
Stereoscopic photograph (Figure 2).
28
Regarding Vision; biological & digital
A red and blue stereoscopic image of asteroid Ryugu, prepared from the images taken by the Optical Navigation Camera - Telescopic. The two color channels are combined to produce the final, 3-dimensional result (Figure 3).
Gesthimani Roumpani
29
increasing energy
increasing wavelength 0.0001 nm 0.01 nm gamma rays
10 nm x-rays
1000 nm
uv
0.01 cm
infrared
1 cm
1m radio waves
radar TV FM
500 nm
level vision (scotopic), and the latter for color vision and spatial acuity (photopic). According to the Center for Imaging Science, there are three types of cones in our eyes: the short-wavelength sensitive (S-cones), the middle-wavelength sensitive (M-cones), and the long-wavelength sensitive cones (L-cones)3.
The number of cone photoreceptor types is one important factor that differentiates animal from human vision. What we call RGB (Red Blue
AM
The EM spectrum (Figure 4)
VISIBLE LIGHT 400 nm
100 m
600 nm
700 nm
Green) channels refers to those three types. Color blind people are usually missing one cone type, whereas those called “tetrachromats� have an additional cone type that allows them to see more colors where most just see a single hue4. In the electromagnetic (EM) spectrum5, human photoreceptors are operational at the mesopic levels (visible light)6, a range towards the center of the spectrum (Figure 4). But how does that compare to other living beings, like insects or animals? Vision of animals The number of cone photoreceptor types is one important factor that differentiates animal from human vision. Although not a rule, oftentimes this number is associated with the ability to visualize a larger area within the EM spectrum. In other cases,
Human vision (left) VS bird vision (right) (Figure 5).
Human vision (left) VS bird vision (right) (Figure 6).
more photoreceptors simply allow for a less computation-heavy process, like in the case of the mantis shrimp. This species is found to have 12 cones, but does not have a particularly good color vision. Each cone has a very narrow range and is essentially dedicated to a very specific color. For the shrimp’s brain that means there is no need for intensive calculations when identifying the colors of different prey7.
Birds and bumblebees are also capable of seeing the ultraviolet spectrum. Interestingly enough, the latter have as many cones as humans, but their spectrum is shifted more towards the UV wavelength (Figures 5 & 6). Boas and pythons on the other hand, are able to see the infrared spectrum9, found on the other side of the visible light range.
Another interesting species is the common bluebottle butterfly, found to have 15 types of color photoreceptors8, including some operational at ultraviolet light exposure. According to researchers, most of those cones are used for navigation in very specific environments as opposed to a day-to-day basis, when they just use four of their cones.
All the aforementioned examples compare vision in terms of color and take the ability to perceive the 3-dimensional nature of the world as a given. But creatures the size of ants are sometimes too small for the third dimension, hence having very minimal spatial ability.
...and ants
The eyes of ants are populated
with optical units called ommatidia. The fewer the ommatidia, the blurrier the ant’s vision10 (Figure 7). Generally, ants use chemical trails that form a collective memory for wayfinding11 - which means a new obstacle on a known trail would need some time before it is reflected on the chemical trail. As part of a study12 examining ant vision, obstacles were placed to interrupt a route known to various species of ants. The bigger ants that have more ommatidia could also use their eyes for navigation, whereas the smaller ones only used chemical trails and thus could not successfully avoid the new obstacle as fast. The study essentially proved that the bigger the ant, the better its vision and spatial acuity.
This case of being too small, or “flat�, is commonly described as ants only being able to see two dimensions at a time. Thinking of 2-dimensional vision in terms of ants though, is directly tied to lower resolution. Computer vision, as expressed in the form of pixels, is also 2-dimensional but capable of a remarkably better resolution. So, how do computers decode image input into a result that makes sense to them? How do they analyze the picture in order to later be able to generate their own original content, mimicking real, tangible forms? Computers & dimensions Following the brief about different types of vision, all within the electromagnetic spectrum, it is time to examine how computers
Human eye view
Ant's eye view (650 ommatidia)
Human vision VS ant vision (Figure 7).
Ant's eye view (150 ommatidia)
32
Regarding Vision; biological & digital
“see� the world, or better, the images we supply them with and those they create in return. Leaving the concept of vision within the EM spectrum and shifting to computer vision, let us first consider how an image size is described in digital terms. This is done in the form of 2 numbers representing the number of pixels, eg. 1920x1080. Hence the intuitive answer that computer vision is 2-dimensional, which brings up the question of how the third dimension, depth, is taken into consideration. Starting from grayscale images, we only have one color channel. The grayscale channel gives each pixel a value from 0 to 255, to quantify the brightness intensity13 (Figure 8). Therefore, if we created a geometry to represent dimensions x,y, and z (where z, see grayscale), it would look like a 2-dimensional surface living in a 3-dimensional space (Figure 9). The reason it is considered 2d is because only x and y are parameters, whereas z is a value dependent on x and y. The best example to illustrate what that means, would be the surface of the earth; it is always defined by 2 coordinates14, even though it has different heights throughout. That is because one can choose how to move on the latitude and longitude, but there is no choice for the height; that is a given number we cannot chanGesthimani Roumpani
Figure 8
33
Grayscale image as 2d surface in 3d space (Figure 9).
Color hues as digital coordinates (Figure 10).
Pixel areas, distinct but slightly overlapping (Figure 11).
34
Regarding Vision; biological & digital
ge. Similarly, the 2D matrix that a computer sees when looking at an image is a set of x and y values. The grayscale channel can be the z component, which results in a surface with peaks and valleys, much like the surface of the earth. When we replace the grayscale channel with the RGB channels, complexity is added due to what was previously one value now being three (Figure 10); but the main concept remains the same.
As a result, we end up with three 2d surfaces each in a 3d environment. When the computer combines the three to get the final output of the image, the combination of the three matrices results in a 2d surface within a 5-dimensional environment, which is not possible to translate in human visual terms. But similarly to the grayscale example, the surface is 2-dimensional because only two parameters are subject to change; the other three (RGB) are dependent on x and y.
What neural nets see (Figure 12).
This time, we have five values for each pixel (x,y, R, G, and B). To achieve a visual translation in human terms, we will break up the image in three matrices: • x, y, and R • x, y, and G • x, y, and B Gesthimani Roumpani
Even though this breaks down how computers understand an image, it still does not explain the process of image classification, which is what allows GANs to generate realistic-looking results.
35
The Architecture of Convolutional Neural Networks The most sophisticated method for computers to analyze patterns is the Convolutional Neural Network (CNN), which is directly linked to computer vision. As explained above, for each colored image a computer sees three matrices. Hence an image that for us has 1000 pixels, for a computer it has 3x1000=3000 pixels. But how does this help detect and classify something in an image as a specific object? This process is achieved in multiple layers, one applied on top of the other. As explained by Devanshi Upadhyay (2019) in her article on Medium15, when analyzing an image, a CNN looks at distinct but slightly overlapping areas, trying to recognize patterns in the arrangement of pixels (Figure 11). CNNs use sets of pattern-searching units (convolution kernels) consisting of the same number of pixels as the areas examined, in order to achieve a one-to-one relationship between the kernel and the original area. The kernel works as a filter by which the pixel array of the original image is multiplied in order to produce a new, filtered version of the image (feature map).
in the relationship between pixels within the filtered image, and achieves a more natural result. Increasing the pattern complexity a CNN is searching for, also increases the complexity of computation together with the number of feature maps and ReLU layers needed. This is where the pooling layer comes in, to restrict the number of patterns the CNN is focused on and only keep the most relevant information. The final layer is the classification layer, which is used to categorize content within an image, per the patterns recognized. This is the only layer that is fully connected, like classic neural networks are, hence the name “fully connected layer” (FC). The process of producing and putting together layers as described above is repeated multiple times before the final FC layer is produced, to enable the CNN to classify the image.
The next step is to apply a rectification operation layer (ReLU) on the new image, which trains and tunes the network. This is achieved by eliminating linearity
This is a very high level, non-technical explanation of a CNN architecture, in an attempt to visually translate in human terms how computers “see” images and break away from the stereotype of how “simple” it is to classify objects within an image. Even though our brains operate with their "biological algorithms" in order to transform the input from the environment into something we can classify and understand, computers that by default work with algorithms took years to get even remotely
36
Regarding Vision; biological & digital
Faces imagined as seen by computers (Figure 13).
Gesthimani Roumpani
37
Deep Dream, AI visualization algorithm by Google (Figure 14).
close to how our "biological algorithms" work. CNNs were the answer to taking this leap. GANs use CNNs as their generator and discriminator models in order to produce their content16. The generator studies the initial image package it is supplied with and then produces images following the patterns it has detected. The discriminator on the other hand, compares the generator’s images to the patterns it has detected in the initial image package. Conclusion Although neural networks were inspired by the structure and functionality of the human brain and are often described in an anthropomorphized way17, they have many differences, including but not limited to vision, as examined in this chapter. Computer vision is structured so as to mimic the human vision; recognize and reproduce the visible light spectrum by following a specific yet computation-heavy process. Understanding this distinction is imperative for one to fully grasp how powerful neural networks and subsequently GANs are, and what their ability to generate realistic looking images really means for computer science evolution.
References 1 Southern California Earthquake Center (n.d.). How do I see depth? Retrieved from http:// scecinfo.usc.edu/geowall/stereohow.html 2 National Keratoconus Foundation (n.d.). How does the human eye work? Retrieved from https://www.nkcf.org/about-keratoconus/howthe-human-eye-works/ 3 Center for Imaging Science (n.d.). Rods & cones. Retrieved from https://www.cis.rit. edu/people/faculty/montag/vandplite/pages/ chap_9/ch9p1.html 4 Smith, B. (2016, March 11). The incredible and bizarre - spectrum of animal colour vision. Retrieved from https://cosmosmagazine. com/biology/incredible-bizarre-spectrum-animal-colour-vision 5 NASA (2013, March). The electromagnetic spectrum. Retrieved from https://imagine.gsfc. nasa.gov/science/toolbox/emspectrum1.html 6 Center for Imaging Science (n.d.). Rods & cones. Retrieved from https://www.cis.rit.edu/ people/faculty/ montag/vandplite/pages/ chap_9/ch9p1.html 7 Thoen, H. H., How M. J., Chiou T. H., & Marshall J. (2014, January 24). A different form of color vision in mantis shrimp. PubMed e-print. Retrieved from https://www.ncbi.nlm.nih.gov/ pubmed/24458639 8 Chen, P. J., Awata, H., Matsushita, A., Yang, E. C., & Arikawa, K. (2016, March 8). Extreme Spectral Richness in the Eye of the Common Bluebottle Butterfly, Graphium sarpedon. Frontiers in Ecology and Evolution. https://doi. org/10.3389/fevo.2016.00018
11 Heyman, Y., Vilk, Y., & Feinerman, O. (2019, April 5). Ants Use Multiple Spatial Memories and Chemical Pointers to Navigate Their Nest. iScience, 14, 264-276. https://doi. org/10.1016/j.isci.2019.04.003 12 Palavalli-Nettimi, R., & Narendra, A. (2018, April 6). Miniaturisation decreases visual navigational competence in ants. Journal of Experimental Biology. doi: 10.1242/jeb.177238 13 Visalpara, S. (2016, January 15). How do computers see an image? Retrieved from https:// savan77.github.io/blog/how-computers-see-image.html 14 GISGeography (2019, Mar 4). Latitude, Longitude and Coordinate System Grids. Retrieved from https://gisgeography.com/latitude-longitude-coordinates/ 15 Upadhyay, D. (2019, February 17). How a Computer Looks at Pictures: Image Classification. Retrieved from https://medium.com/datadriveninvestor/how-a-computer-looks-at-pictures-image-classification-a4992a83f46b 16 Richรกrd, N. (2018, September 4). The differences between Artificial and Biological Neural Networks. Retrieved from https:// towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7 17 Shibuya, N. (2017, Novvember 2). Understanding Generative Adversarial Networks. Retrieved from https://medium.com/activating-robotic-minds/understanding-generative-adversarial-networks-4dafc963f2ef
9 Smith, B. (2016, March 11). The incredible and bizarre - spectrum of animal colour vision. Retrieved from https://cosmosmagazine. com/biology/incredible-bizarre-spectrum-animal-colour-vision 10 Palavalli-Nettimi, R. (2018, April 18). The science behind ant vision. Retrieved from https:// www.australiangeographic.com.au/topics/wildlife/2018/04/the-science-behind-ant-vision/
40
Regarding Vision; biological & digital
Images Figure 1 (p. 26-27 full spread background) OMIKRON / Science Photo Library (n.d.). False-colour SEM of rods and cones of the retina [electron micrograph (SEM)]. Retrieved from https://www.sciencephoto.com/media/308755/ view Figure 2 Antoine Claudet (n.d.). [Stereoscopic print]. Retrieved from https://photofocus.com/photography/history-of-photography-stereoscopic-photography/ Figure 3 JAXA, University of Aizu, University of Tokyo, Kochi University, Rikkyo University, Nagoya University, Chiba Institute of Technology, Meiji University and AIST (2018). A stereoscopic image of Ryugu at high resolution [photography by the Optical Navigation Camera - Telescopic (ONC-T)]. Retrieved from http://www.hayabusa2.jaxa.jp/en/topics/20180731e/index.html Figure 4 Illustration inspired by; Electromagnetic spectrum visible light wavelengths [digital illustration]. Retrieved from https://adiklight.co/ electromagnetic-spectrum-visible-light-wavelengths/
Figure 10 Retrieved from http://assets.runemadsen.com/ classes/programming-design-systems/pixels/ index.html Figure 11 Humboldt State University (n.d.). [Digital illustration]. Retrieved from http://gsp.humboldt. edu/olm_2016/courses/GSP_216_Online/lesson3-1/raster-models.html Figure 12 Kogan, G. (2017). [Digital Illustration by AI]. Retrieved from https://experiments.withgoogle. com/what-neural-nets-see Figure 13 Ferriss, A. (2015). [Digital portrait]. Retrieved from https://www.fastcompany.com/3047445/ hallucinatory-portraits-show-how-computers-see-our-faces Figure 14 Google Deep Dream (n.d.). [Digital painting generated by AI]. Retrieved from https://guff. com/the-way-your-computer-sees-the-worldwill-terrify-you
Figure 5 Sartore, J. (n.d.). [Photography]. Retrieved from https://www.boredpanda.com/human-vs-birdvision/?utm_source=google&utm_medium=organic&utm_campaign=organic Figure 6 [Photography}. Retrieved from https://www. boredpanda.com/human-vs-bird-vision/?utm_ source=google&utm_medium=organic&utm_ campaign=organic Figure 7 Murray, T. (n.d.). [Digital illustration]. Retrieved from https://theconversation.com/in-an-antsworld-the-smaller-you-are-the-harder-it-is-tosee-obstacles-92837 Figure 8 Retrieved from http://assets.runemadsen.com/ classes/programming-design-systems/pixels/ index.html Figure 9 Van Eck, W., & Lamers, M. H. (2015). Grayscale 512 by 512 pixel image showing fungi and bacterial cultures (left), and generated terrain using the image directly as a heightmap (right) [digital illustration]. Retrieved from https://www. semanticscholar.org/paper/Biological-Content-Generation%3A-Evolving-Game-Living-Eck-Lamers/4cf21eb5c06752f598d968b1c3f7570598ec54dd
Gesthimani Roumpani
41