I have had a number of people ask me why we should bother creating natural language processing (NLP) tools that preserve privacy. Apparently not everyone spends hours upon hours thinking about data breaches and data privacy infringements. Shocking!

The main argument for privacy-preserving NLP comes down to one fact: text and speech are our primary methods of communication. When interacting with web-based service providers and other users thereof, we often allow companies to store, use, and even sell the messages we have sent and received on their platforms.

So why do we even share this information with third parties in the first place? Frequently, the answer lies in our desire to get “free” services.

After the Cambridge Analytica/Facebook scandal, one can only hope that the general public has been made aware of how the personal data they have unknowingly been giving away in exchange for “free” services can be sold to third parties for any purpose at any time, without their knowledge or consent. Sure, the thought of targeted ad campaigns siloing us into a political echo chamber is scary, but isn’t it even worse to think that the data being used to target us might not even be de-identified when sold to third-parties? Heck, even if the data is de-identified, cross-referencing them with other databases for the purpose of re-identification is far from unheard of. Also keep in mind that the more de-identified a dataset is the less useful it becomes for all sorts of tasks, making it less monetizable. Sure, your name and number can be removed, but how about the locations you’ve visited, your preference for certain restaurants, and even your favourite tea flavours? Innocuous-seeming information becomes part of your digital fingerprint and can give your identity away, therefore we should be incredibly wary of what written and spoken data we allow onto the marketplace.

We readily give away our data in exchange for convenience. Biometric authentication is an example of this. Simply put, it means using some unique part of your body/physiology as a password — think of using your fingerprint or face to unlock your computer. In the context of NLP, think of your voice being used by your bank to verify your identity during a customer service call (i.e., speaker authentication). There are two types of speaker authentication: text-dependent and text-independent. In the former, a spoken password is used to identify you in addition to comparing features extracted from your voice to that of previous recordings. In the latter, the system only uses features extracted from your voice while you speak to the automated or human representative without you having to say any particular keywords. Securing user data within a speaker authentication system is by far the most researched area related to privacy-preserving NLP, though it certainly is not a solved problem. One common concern is the feature vectors associated to a user’s voice becoming compromised. Such data is therefore encrypted at rest. Another concern is that of replay attacks, where a user’s voice is recorded and played to the authentication system with the purpose of gaining access to particular files or locations. A band-aid solution to this is to use more than one mode of authentication (speech + fingerprint, speech + face recognition, speech + pin number, etc), such that using speech becomes a security enhancer rather than a system’s central component. Regardless of the system, a user’s privacy must be preserved from malicious outsiders in order for the system to remain reliable. Speech authentication is one of the few examples where service providers have strong inherent incentive to put a lot of time and money into making sure our spoken data remains private from third parties. They are, however, still storing a biometric identifier associated with your personal information. An identifier that can be used to determine whether you are speaking in other audio or video recordings. Civil liberties groups argue against voice identifiers being stored without explicit consent (such as at Her Majesty’s Revenue and Customs in the UK [2]) and without more information about how it is stored/shared/erased.

What about exchanging data privacy for physical security? A concern that is often raised regarding data being encrypted without any backdoors is whether that makes it easier for nefarious communications (e.g., between terrorists and criminals) to go undetected. And, hey! If you’ve got nothing to hide you’ve got nothing to fear, right? That wonderfully nonsensical line is reserved for the fraction of us who have the good fortune of living in democratic societies where we won’t get thrown in jail and/or tortured for speaking our minds about the political party in power. Okay, let’s humour the people who think that line flies and that it’s a-okay for (some) governments and police to have quasi-unrestricted access to citizens’ private data via backdoors or requests for information from companies (predominantly without a warrant).

Data breaches. Suppose that for whatever reason there is some government or company that you sincerely trust with storing your written or spoken content. Splendid!
Well, I hate to break it to you, but over 2.6 billion records were breached in 2017 alone (76% due to accidental loss, 23% due to malicious outsiders) [1]. Here’s the crux: you might trust an organization’s intended use of your data without having a clue about how they protect it from being leaked.

Fine, then let’s not even bother sharing our data in the first place. Why not just prevent people from accessing speech and text we produce altogether? Because we want to be provided with free or cheap services (Facebook, Twitter, Instagram, …). We also want training data for AI algorithms that are adaptable to our specific traits and preferences (speech recognition systems, personal assistants, search engines, …).

So how do we get what we want, give companies the data they need to profit/improve services AND maintain our privacy? Research in privacy-preserving NLP is in its infancy, but it is likely to revolutionize the way companies and governments collect, store, process, and sell user data. With regulations like the GDPR coming into effect, public outcry over the Cambridge Analytica scandal, and the massive number of hacks that cost companies millions of dollars in reparations every year, the number of privacy-preserving data processing algorithms will snowball. I will be going into some detail on various (practical as well as still computationally intractable but promising) existing solutions in future blog posts, including privacy-preserving surveillance, federated learning, the application of differential privacy to neural networks in order to prevent reverse-engineering, homomorphically encrypted NLP, and so on.

Acknowledgements. My deepest gratitude to Kelly Langlais, Dr. Siavash Kazemian, and Simon Emond for their invaluable feedback on this post.

References
[1] https://breachlevelindex.com/assets/Breach-Level-Index-Report-2017-Gemalto.pdf
[2] https://go.newsfusion.com//security/item/1234791