Loading…

Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection

Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make ren...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE journal of selected topics in signal processing 2020-08, Vol.14 (5), p.1024-1037
Main Authors:	Chintha, Akash, Thai, Bao, Sohrawardi, Saniat Javid, Bhatt, Kartavya, Hickerson, Andrea, Wright, Matthew, Ptucha, Raymond
Format:	Article
Language:	English
Subjects:	Audio data Computer architecture Convolution Cost function Datasets Deception deep learning deepfake Digital media Entropy Face Feature extraction Forgery Information integrity Personal computers Representations spoof Videos
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make renditions realistic enough to easily fool a human observer. Detecting deepfakes is thus becoming important for reporters, social media platforms, and the general public. In this work, we introduce simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection. Our methods combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions. The latent representations for both audio and video are carefully chosen to extract semantically rich information from the recordings. By feeding these into a recurrent framework, we can detect both spatial and temporal signatures of deepfake renditions. The entropy-based cost functions work well in isolation as well as in context with traditional cost functions. We demonstrate our methods on the FaceForensics++ and Celeb-DF video datasets and the ASVSpoof 2019 Logical Access audio datasets, achieving new benchmarks in all categories. We also perform extensive studies to demonstrate generalization to new domains and gain further insight into the effectiveness of the new architectures.
ISSN:	1932-4553 1941-0484
DOI:	10.1109/JSTSP.2020.2999185