An approach called federated learning trains machine learning models on devices like smartphones and laptops, rather than requiring private data to be transmitted to central servers.
The largest benchmark dataset to date for a machine learning technique designed with privacy in mind is now available as open source.
“By training on-premises on data where it’s generated, we can train on larger, real-world data,” said Fan Lai, a graduate student in computer science and engineering at the University of Michigan who used the FedScale training environment on the presents International Conference on Machine Learning this week. A paper about the work is available on ArXiv.
“This also allows us to mitigate privacy risks and high communication and storage costs associated with collecting the raw data from end-user devices in the cloud,” says Lai.
Federated learning is still a new technology and is based on a algorithm who acts as the central coordinator. It delivers the model to the devices, trains it locally with the relevant user data, and then brings each partially trained model back and uses them to generate a final global model.
For a number of applications, this workflow provides additional privacy and security measures. messaging apps, health datapersonal documents and other sensitive but useful training materials can improve models without fear of data center vulnerabilities.
In addition to protecting privacy, federated learning could make model training more resource-efficient by reducing and sometimes eliminating large data transfers, but it faces several challenges before it can be widely deployed. Training on multiple devices means there are no guarantees of available computing resources, and uncertainties such as user connection speeds and device specifications result in a pool of data options of varying quality.
“Connected learning is a rapidly growing area of research,” says Mosharaf Chowdhury, associate professor of computer science and engineering. “But most of the work uses a handful of data sets, which are very small and don’t represent many aspects of federated learning.”
And this is where FedScale comes in. The platform can simulate the behavior of millions of user devices on a few GPUs and CPUs, allowing machine learning model developers to explore how their federated learning program works without the need for large-scale deployment. It serves a variety of popular learning tasks, including image classification, object recognition, language modeling, speech recognition, and machine translation.
“Anything that uses machine learning on end-user data could be federated,” says Chowdhury. “Applications should be able to learn and improve how they provide their services without actually recording everything their users are doing.”
The authors specify several conditions that must be considered to realistically mimic the federated learning experience: heterogeneity of data, heterogeneity of devices, heterogeneous connectivity, and availability constraints, all with the ability to operate at multiple levels on a variety of machine learning tasks. According to Chowdhury, FedScale’s datasets are the largest yet published and are specifically designed to address these federated learning challenges.
“We have collected dozens of datasets over the last few years. The raw data is mostly publicly available, but difficult to use because it comes in different sources and formats,” says Lai. “We are continuously working to support large-scale deployment on the device as well.”
The FedScale team has also launched a ranking to promote the most successful federated learning solutions trained on the university’s system.
The National Science Foundation and Cisco supported the work.
Source: Zachary Champion for University of Michigan