In Part 1 of this guide, we covered many of the basics of homomorphic encryption. In Part 2, we will discuss a practical application of homomorphic encryption to privacy-preserving signal processing. For those of you who have no interest in signal processing, please stay tuned for Part 3 of the guide, which should be useful to a more general audience.

The goal of this post is twofold: 1) demonstrating that signal processing in the encrypted domain can actually be quite practical and 2) providing an overview of some of the code written for Professor Gerald Penn and my paper for Interspeech 2019 titled “Extracting Mel-Frequency and Bark-Frequency Cepstral Coefficients from Encrypted Signals.”

Real Discrete Fourier Transform

The Fourier Transform (FT) is one of the most fundamental operations in signal processing. It takes a signal from the time domain to the frequency domain. If you happen to need a primer to FT, one of the clearest explanations that I’ve found so far is in “A Student’s Guide to Waves.” If you can’t get your hands on that book, there exists a copious amount of descriptions online.

Now, for those who are already familiar with FT, you might be wondering how we can possibly convert the following formula to the encrypted domain:


Well, we won’t be using the standard FT formula, since it would mean having to deal with the additional complexity of accounting for imaginary numbers within encrypted calculations. Rather, we will use the Real Discrete Fourier Transform’s (RDFT) formulae, which are very nicely described in [1]:


where N is the number of samples, x(·) is the vector representing the signal, and


To calculate the RDFT, we add the Discrete Cosine Transform (DCT) with the Discrete Sine Transform (DST). For simplicity, we refer to the encrypted version of values ∗ as E(∗).

Luckily for us, cos(2πnk/N) and sin(2πmk/N) can be precomputed and do not need to be encrypted prior to multiplying the values with the encrypted signal. We prepare these values for use with the B/FV homomorphic encryption scheme from the PALISADE Lattice Cryptography Library. Note that we do need to choose a precision at which to round the values by which we will multiply our signal. We choose the number of decimal places to be retained based on a desired level of accuracy φ, then multiply the data by 10^φ (we use ^ to mean “to-the-power-of”) and round to the nearest integer.

Let’s work out the precomputations first. For the DCT, we have to calculate:

We code up the highlighted section as follows:

[gist /]

We perform almost identical calculations for the DST.

We code up the highlighted section as follows:

[gist /]

I hope that these precomputations are fairly self-explanatory.

Next, we need to encrypt the signal. For our experiments, we used an audio recording of a telephone conversation from the Switchboard corpus. The kind of signal is only relevant here insofar as knowing the sample rate, which is 8kHz. In case we ever need to perform division, we create a fraction struct to store encrypted numbers. While this won’t come in handy for the RDFT calculations, it is necessary for calculating Cepstral Coefficients (which is the topic of the paper this code was originally written for).

[gist /]

We encrypt the signal as follows:

[gist /]

Now we finally have the tools necessary to calculate the RDFT:

[gist /]


For a 128-bit security level, it takes 46.5235 s to calculate the RDFTs of 100 frames, given a sample rate of 16, a frame length of 25, and a shift length of 10 on an Intel Core i-7–8650U CPU @ 1.90GHz and a 16GB RAM.

These computations are, of course, parallelizable. If we were to run this code on a more powerful device and were even to adapt it to GPUs, it might become quite practical to perform encrypted FTs on the cloud.

What can we use this for?

Often, healthcare is the first sector that comes to mind when discussing sensitive data collection and processing. Audio recordings of a patient’s voice can actually contain a lot of information that’s useful for diagnosing disease (shout out to WinterLight Labs, Professor Frank Rudzicz, and Dr. Katie Fraser who have done great work on using speech processing to detect cognitive impairments like dementia).

One method for detecting pathological voices is to calculate jitter, which measures whether a gesture is sustained over a short period of time [2].

T_i ( _ used to mean subscript) are the extracted fundamental frequency (F_0) period lengths and N is the number of extracted F_0 periods [3].

Aside from speech, images are another type of signal which can contain ample sensitive information. The Fourier Transform is often used in image processing (e.g., for image filtering).


This work was supported by an RBC Graduate Fellowship and the BRAIN Ontario Research Fund.

Feel free to email me at pthaineATcsDOTutorontoDOTedu if you would like to chat about encrypted signal processing.


[1] O. Ersoy, “Real Discrete Fourier Transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 4, pp. 880–882, 1985.

[2] P. Perrot and G. Chollet, “Helping the Forensic Research Institute of the French Gendarmerie to Identify a Suspect in the Presence of Voice Disguise or Voice Forgery,” in Forensic Speaker Recognition. Springer, 2012, pp. 469–503.

[3] M. Farrus, J. Hernando, and P. Ejarque, “Jitter and Shimmer Measurements for Speaker Recognition,” in Eighth Annual Conference
ofthe International Speech Communication Association, 2007.