Making Sense of Trends in Data Science

With the profusion of data being generated by us and our devices, distributed learning offers a new way of analyzing data while preserving more privacy

Much of what we do in our daily lives is collected in some database, used to monitor our health, product preferences, geographical locations, political leanings, consumption of energy and food, money expenditures—the list goes on and on. At the same time, the amount of digital data being gathered is growing exponentially.

Organizations and governments increasingly rely on all that data to make decisions in business, health care, public policy, and other critical areas, but extracting useful information from the data trove is a growing challenge as volume increases and the need to analyze it in real time becomes more urgent.

“Data-driven learning is everywhere,” says Usman Khan, associate professor of electrical and computer engineering at Tufts School of Engineering. He was honored in November as lead guest editor for an issue of the Proceedings of the IEEE—one of the leading journals for engineers and scientists—dedicated to the topic of data science, exploring its challenges and optimization.

Tufts has also responded to the emerging trend with the creation of the Data Intensive Studies Center, with faculty and students participating across the university; the Tufts Center for Transdisciplinary Research in Principles of Data Science (T-TRIPODS); and a master’s program in data science. Workshops are being planned to bring data scientists together with others across Tufts using data science algorithms to help solve problems in many different disciplines.

Khan, whose research group recently received an NSF award to study learning and control of a network of quadcopters with the help of optimization theory using in depth data analytics, spoke recently with Tufts Now about what the future holds for data science.

Tufts Now: Data science is a pretty broad term—what exactly does it entail?

Usman Khan: Data science is indeed a very broad concept. It is used to help in areas like finance, investment and banking, medical diagnostics, robotics, and political campaigns. In finance, it can help with detecting credit card fraud, finding anomalous purchases based on a dataset of billions of purchases and the specific patterns of the individual.

It can be used for early detection of breast cancer from a data set of thousands of mammograms. In robotics, it can help interpret incoming images gathered by a robot to detect and identify specific objects and adapt how the robot may interact with them. Political campaigns may use large datasets that suggest the political leanings of individuals, to target messages and appeals to vote for one candidate or another.  And that’s just a sample of what it entails.

Despite its wide range of application, there are common principles at play. In our research and in the Proceedings of the IEEE, we explored ways in which familiar tools and methods from optimization theory can be applied to modern day data analysis.

Optimization theory itself is a very traditional field of mathematics that is commonly used in operations research and studied by applied mathematicians. In many schools of engineering, it used to be that industrial engineering departments would mainly do that work.

But more recently, because of the explosion of data and the advances in both software and hardware to collect, store, and process data, many more fields are now involved in developing methods in data science and optimization, including electrical engineering, computer science, and cognitive science.

What are some of the common principles in data science that apply across disciplines?

One way to break down data science is to say it is for learning and control. Learning is the ability to detect a pattern, identify an object, and find the best possible pathways to get from point A to point B, for example. Control is the action that you can take given what you learned.

Some problems do not have a control element. For example, if you want to classify faces of males and females, that’s a learning problem. You can train an algorithm on a set of several thousand images of males faces and several thousand images of female faces.

Given a random picture of a person, data science allows you to determine with certain confidence whether that it is a male face or a female face. This is something that is done on Facebook, which goes even further to learn and identify specific individuals to tag them in photos.

Data-driven learning is everywhere. Amazon Echo and Google Home apply a very advanced form of learning, which is called natural language processing.

Take a quadcopter, or drone. Let’s say it needs to land on a white square on the ground. The first objective is to find the white square—that’s a learning problem.

Now what if the quadcopter needs to land on a vehicle marked with a white square and traveling on the ground? Depending on the terrain, whether this vehicle is moving in a forest or a desert or an urban environment, you can see that the learning problem can become much more complicated.

And what are examples of it being used for control?

The landing process of a quadcopter, for example, is a control problem, because there are many ways to navigate and land, with variables you can change such as speed and direction, and parameters you need to consider such as remaining fuel. Each path to the landing spot may present its own advantages and risks—such as wind, obstacles, or other environmental disturbances.

As the quadcopter moves, another round of learning kicks in to re-evaluate the conditions, the pathways and the advantages and risks. All this is done in real time. Because of the role of risk in control decisions, there may be an element of game theory at play.

The interaction between learning and control comes up in many scenarios, such as robotic surgery or self-driving cars. For self-driving cars, you can envision an entire network of thousands of vehicles coordinating their activity and optimizing traffic—even to the point of eliminating the need for traffic lights.

What’s the next big thing for data science?

An area of significant growth will come from distributed learning. It’s based on the Internet of things—such as cameras that monitor who is coming in and out of your house and refrigerators that have sensors tracking the contents—and what we may call the Internet of mobile things, such as sensors and devices that move around on phones, cars, or even on people, like an Apple watch or Fitbit.

The new development that has brought this Internet of mobile things to the forefront is 5G communication, which will enable phones and devices to communicate with each other directly rather than through a central server or a node-like a cell tower.

This thing-to-thing communication opens up a whole new world in data analysis and optimization. With 5G-enabled device-to-device communication, data analysis can occur without your information ever leaving your device.

If Apple or Google, say, needs to learn faces and objects from your photos, or phrases from your voice samples, they are currently sent to a centralized location to train models to recognize those patterns.

But with device-to-device communication, we are able to write algorithms that operate on each device and collaborate with other devices to learn as if it were trained on one large global data set, but our data is never asked to leave our device, which is a good thing for privacy.

What are some implications for this device-to-device learning?

Imagine that one hospital has 5,000 mammograms and another hospital has 8,000 mammograms. Any technique that is designed for early detection of breast cancer is better off if it can access all 13,000 scans, but health-care data must remain private and protected.

In distributed learning, Hospital A trains on its set of 5,000 scans and shares the result, which in this case would be criteria for finding cancer on a mammogram, but not the original data, with Hospital B. Hospital B then takes that result and refines it on its own set of 8,000 mammograms to achieve a more accurate result.

The results are passed back and forth until the solution no longer changes—when the distributed learning algorithm reaches an equilibrium state. In my research, we have mathematically shown that distributed learning is just as accurate as centralized learning and in many cases of practical interest, provides a much-needed speedup over the centralized processing times.

The field of distributed learning also opens the door for competitors to collaborate in ways that benefit their industries as a whole. Competitors can agree to develop learning algorithms on their collective data without having to share or pool their proprietary data sets.

More details on Khan’s research can be found on his website. He can be reached at khan@ece.tufts.edu.

Mike Silver can be reached at mike.silver@tufts.edu.

Back to Top