Deepfake audio

flowchart LR
	A{{Reference sound extract}}-->B[Voice encoder]--Footprint of the voice -->E
	C{{Text extract}}-->D
	subgraph Synthesizer
	D[Encoder]-->E[Concat]-->F[Attention]-->G[Decoder]
	G-->F
	end
	G--Mel-Spectrogram -->H[Vocoder]-->I{{Audio Signal}}

Creation of a "footprint" of the person's voice using the encoder
Adding this footprint to the synthesizer which translates any text into a mel-spectrogram. The synthesizer produces the general mechanisms of speech and with the addition of the footprint, we can say that the spectrogram also contains the nuances of the person's voice.
The mel-spectrogram is interpreted by the vocoder which translates it into an audio signal that we can listen to.

Source: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis