
07 AUG 2020
The Future of AI : Federated Learning
Introduction
Standard machine learning approaches require centralizing the training data on one machine or in a datacenter / cloud. This centralized training approach, however, is privacy-intrusive, especially for mobile phone users. This is because mobile phones may contain the owners’ privacy-sensitive data. To train or obtain a better machine learning model under such a centralized training approach, mobile phone users have to trade their privacy by sending their personal data stored inside phones to the clouds owned by the AI companies.
To preserve privacy, In 2017 Google introduced Federated Learning (FL), “a specific category of distributed machine learning approaches which trains machine learning models using decentralized data residing on end devices such as mobile phones.”
Compared to the centralized training approach, federated learning is a decentralized training approach which enables mobile phones located at different geographical locations to collaboratively learn a machine learning model while keeping all the personal data that may contain private information on the device. In such a case, mobile phone users can benefit from obtaining a well-trained machine learning model without sending their privacy-sensitive personal data to the cloud.
REAL-WORLD USE CASES
It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.
Currently Google is testing this Federated Learning in Gboard on Android, the Google Keyboard. When Gboard shows a suggested query, your phone locally stores information about the current context and whether you clicked the suggestion. Federated Learning processes that history on-device to suggest improvements to the next iteration of Gboard’s query suggestion model.

DIFFERENTIAL PRIVACY
Differential privacy is a rigorous mathematical definition of privacy. Goal - to ensure that different kinds of statistical analysis don’t compromise privacy. Privacy is preserved if - after the analysis, the analyzer doesn't know anything about the people in the dataset. The people in the dataset remain “unobserved”.
“Differential privacy makes it possible for tech companies to collect and share aggregate information about user habits, while maintaining the privacy of individual users”.
Example - let’s consider a scenario when you, a smoker, decided to be included in a survey. Then, analysis on the survey data reveals that smoking causes cancer. Will you, as a smoker, be harmed by the analysis? Perhaps — Based on the fact that you’re a smoker, one may guess at your health status. It is certainly the case that he knows more about you after the study than was known before (this is also the reason behind saying it is “general information”, not “public information”), but was your information leaked? Differential privacy will take the view that it was not, with the rationale that the impact on the smoker is the same independent of whether or not he was in the study. It is the conclusions reached in the study that affect the smoker, not his presence or absence in the data set.
DALENIUS’S AD OMNIA GUARANTEE (1977): [Impossible to enforce]
Anything that can be learned about a participant from the statistical database [only public info] can be learned without access to the database. Information about the individuals that has been made public elsewhere isn’t harmful to the individual.
CYNTHIA DWORK’S DEFINITION OF DIFFERENTIAL PRIVACY:
A promise made by a data holder or curator, to a data subject and the promise is like :
“You will not be affected adversely or otherwise, by allowing your data to be in any study or analysis, no matter what other studies, datasets or information sources are available”
DATA ANONYMIZATION:
Anonymization is a data processing technique that removes or modifies personally identifiable information; it results in anonymized data that cannot be associated with any one individual.
Why Data Anonymization isn’t strong enough ?
Example - Netflix movie recommendation system
Hands-on session Jupyter Notebook links :
The Basic Tools of Private Deep Learning
A Toy Federated Learning Example
Federated Learning on MNIST using a CNN
CONCLUSION:
Federated learning is revolutionizing how machine learning models are trained. Google has just released their first production-level federated learning platform, which will spawn many federated learning-based applications such as on-device item ranking, next-word prediction, and content suggestion. In the future, machine learning models can be trained without counting on compute resources owned by giant AI companies. And users will not need to trade their privacy for better services.