Three Years of Crowdsourcing Smart Home Network Traffic

July 25, 2022

Deployed System

Authors:

Article shepherded by:

Rik Farrow

We [1] developed IoT Inspector, an open-source tool that allows the owners' of smart home devices to monitor those devices' network traffic and discover potential security and privacy risks. In this article, we discuss some of what we discovered but also the problems we are facing in collecting reliable and accurate data.

Overview of smart home security and privacy risks

Since the COVID-19 pandemic, new trends have emerged in how people interact with their homes. To many adults, the home is both a place to rest and a place to work. To many children who cannot attend schools, the home is their new learning environment. Also, to individuals with underlying health issues, the home is where they seek care and refuge.

Along with this shift in trend is the increasing adoption of smart home technologies, also known as Internet-of-Things: for example, smart TVs and speakers for entertainment; smart toys for children to play; smart health devices that require at-home healthcare monitoring and intervention.

Many of these smart devices are breeding grounds for security and privacy threats. They could have open ports or run vulnerable software. The 2016 Mirai botnet attack is one such example, where home surveillance cameras and DVRs were compromised to launch distributed denial-of-service attacks, as these devices used easy-to-guess passwords. Similarly, numerous news reports abound on home security cameras and baby monitors that were compromised to spy on the inhabitants. In addition to external adversaries, the developers of such devices may also cause harm to the user’s privacy. Some smart TV models, for instance, still send potentially sensitive information to advertisers even after the user has explicitly enabled do-not-track features in the TV settings.

Even more concerning is the fact that these smart devices are often on the same trusted network as other traditional devices: personal phones, tablets, and PCs, along with corporate devices. It is unclear whether and how malware could move laterally from one class of device to another, and whether sensitive personal or corporate information could be leaked. As people spend more time at home, living, playing, studying, working, and healing, they are facing increasing security and privacy risks from within.

Challenges for researchers

Despite these concerns, researchers who study smart home security and privacy have to deal with a major challenge: smart home devices are physical objects, and there is a large variety in terms of the types and manufacturers of devices. It is difficult to automate the setup and analysis of these physical devices at scale, unlike studies of mobile apps. As such, many studies on smart home security and privacy are limited to a small subset of devices in the lab. One way to scale up is to scan the IPv4 space of the Internet, but the results are restricted to Internet-exposed devices, while overlooking devices on private home networks.

This lack of scale makes it difficult to collect real-world data about the vast variety of smart devices on the market—including not only technical data, such as the network traffic, but also human behavior data, such as users’ actual awareness and perception of their own smart devices, considering that smart home security and privacy is both a technical and human problem. Such real-world data would be valuable for understanding the actual security and privacy risks. It would also help researchers develop practical mitigation strategies based on empirical evidence.

In short, just as ImageNet helps researchers develop computer vision models, smart home security and privacy researchers lack a labeled dataset on smart home devices, such as device traffic and user behaviors.

Crowdsourcing smart home traffic

When we faced this challenge back in 2019, we asked ourselves: “Can we ask real users to run experiments for us, since they are the ones with the large variety of smart home devices?” We thought of paying participants, but we would like to avoid being limited by our budget. We would like a way to let users willingly help us run experiments, because they themselves would gain some benefits.

So we developed an open-source tool for smart home users: IoT Inspector. It visualizes the network activities of smart home devices for users to identify potential security and privacy risks. We developed the user interface and user experience to encourage organic adoption of the tool by users. In return, users donate anonymous data for our research, including aggregated network traffic metadata (such as remote hostnames, IP addresses, and ports) over time, as well as any optional labels provided by the user (such as models and manufacturers of smart devices).

We tried to streamline the user experience, reduce friction, and encourage widespread adoption. Here is an example of how a typical user may interact with IoT Inspector. The user would first download IoT Inspector to their computer. The computer must be running Windows, macOS, or Linux if the user wants to run IoT Inspector’s prepackaged binary; alternatively, the user can simply run IoT Inspector’s source code with Python. The user would run IoT Inspector as the administrator or root while connected to their home network. IoT Inspector would scan the network and discover devices. The user would select devices for inspection. IoT Inspector would then automatically capture network traffic of selected devices, using a technique called ARP spoofing. Throughout this process, the user does not need special cables or dedicated hardware. They do not need to reconfigure their home gateways.

Figure 1: Main dashboard of IoT Inspector

Figure 1 shows an example of IoT Inspector's main dashboard, provided by a member of the research team in their own smart home. It shows the network traffic of various smart devices in the past 20 minutes. Users can also view the network activities of individual devices. For example, Ira Flatow, a reporter with National Public Radio, independently used IoT Inspector to analyze the network traffic of their Roku TV. As shown in Figure 2, Flatow shared a screenshot of IoT Inspector in action, which shows that the Roku TV contacted a number of advertising services, including Scorecard Research and Alphabet (DoubleClick).

Figure 2: Ira Flatow, a reporter with National Public Radio, used IoT Inspector to analyze the network traffic of their Roku TV. The screenshot above shows the network traffic that Flatow observed.

We launched IoT Inspector in April 2019. Since then, IoT Inspector has collected the network traffic from more than 63,000 devices. This dataset includes traffic metadata, such as the remote IP addresses, hostnames, and ports, aggregated over 5-second windows. The dataset also includes names and manufacturers of a subset of the devices. Interested readers can view a sample dataset at this link.

These 63,000 devices belong to some 6,400+ IoT Inspector users. We did not pay users. They downloaded the tool on their own. These users are anonymous, but a few identified themselves publicly, including reporters from National Public Radio (Ira Flatow) and Washington Post (Geoffrey A. Fowler), who used IoT Inspector to investigate privacy risks on smart TVs.

To our knowledge, this is the largest open dataset of real-world IoT traffic. A number of ongoing research projects are using this dataset, including a published study on smart TV privacy, along with work-in-progress projects on IoT public key infrastructure, smart home firewalls, IoT software supply chain security, and multi-stakeholder privacy in the smart home. Users can directly access the raw data from IoT Inspector’s user interface. Researchers can request the dataset through this link.

Below, we describe a few challenges in the past three years of operating IoT Inspector and using its dataset for research.

Challenge 1. Identifying devices and labeling devices.

A common question that our users ask is: “What is this device on my network?” When a user runs IoT Inspector for the first time, IoT Inspector shows a list of devices on the network. The user can choose one or multiple devices to "inspect", have the traffic captured, and view the analysis.

We currently use these features to infer device identities, although not all devices can be identified by IoT Inspector:

MAC OUI. The first 3 bytes of a device's MAC address. It shows the company that manufactured the wifi chip, rather than the device. This generally works well for, say, Amazon devices, but not others. For example, we would often see the name “Espressif,”which is a popular manufacturer of IoT boards behind many brands. This information is not helpful for device identification.
DHCP hostname. A device may announce its hostname as it obtains an IP address via DHCP. For example, some smart door locks announce themselves this way. The problem is that few devices announce their hostnames via DHCP.
HTTP user agent. The user agent string sometimes shows what device it is. For example, a Samsung TV's HTTP user agent string would include the term “Tizen.” The problem is that the HTTP traffic must be in plaintext for IoT Inspector to see the user agent string for device identification; the widespread adoption of HTTPS makes this process difficult. Also, there are not so useful cases, like when the user agent is simply "curl",
mDNS and UPnP announcements. Again, they are useful in identifying devices—just like DHCP hostnames—but the problem is that not all devices support mDNS and UPnP.
Hostnames contacted by the smart device. They are useful when the hostnames can uniquely identify the device; for example, roku.com is typically contacted by Roku TVs. However, popular infrastructure providers, such as AWS, are not useful in device identification.

In addition to inferring device identities, IoT Inspector also asks users to label their devices with the device name and manufacturer. Effectively, we're crowdsourcing device identities from users, but the user labels can be noisy.

Figure 3: IoT Inspector user interface allows users to label devices using a drop-down textbox, which also supports free text.

There are three problems with users' manual labels: missing labels, inconsistent labels, and wrong labels.

Missing labels. Slightly less than half of our users labeled at least one of their devices, telling us the names and manufacturers of devices. Of all the devices, only 25% have user labels.
Inconsistent labels. Users currently label their devices through a dropdown list (Figure 3). We cannot possibly list every single device name there, so we let users enter free text too. Although the free text gives users the flexibility of labeling devices that we do not already know, free text gives us inconsistent labels. For example, a user could label an Amazon Echo as “Amazon Echo” or “Amazon Alexa.” Both are equivalent, but we would have to train our classifier to know that. This is just one of many examples of different ways to label the same device.
Wrong labels: We have seen what was labeled as a "smart fan" was communicating with some Android domains and hundreds of advertising services. As we checked the smart fan’s official website, the fan does not seem to run Android. It is likely that the device was labeled incorrectly.

In short, we can gather a large dataset from real-world users, but we need to improve the label quantity and quality.

Challenge 2: How to communicate risks to users without spooking them

Many of our users ask: What is my device doing? Recall that Figure 2 shows the network activities of a Roku TV, where each color corresponds to a third-party service contacted by the TV. We obtain these names from the DNS or Server Name Indication (SNI) from TLS ClientHellos in some cases. But what if certain DNS packets are missing or cached, or what if ClientHello messages are missing? When that happens, IoT Inspector does not know what company a smart device is talking to.

Even if IoT Inspector knows the remote hostname, the user may not know what the remote hostname means in terms of the device's activity. For example, many Belkin Wemo smart plugs communicate with “api.xbcs.net.” What's xbcs.net? It does not bear the name of the company, Wemo. If you visit xbcs.net, there's no web server. Basically, it is hard to tell what xbcs is, whether it is related to Belkin, or what the device is doing.

Also, the truth could be spooky to users. Here is a real story. A user emailed us and asked: My device is communicating with a military domain; is the military spying on me? The third-party service that their device contacted was “tock.usno.navy.mil”, which is in fact an NTP time server. There are thousands of time servers in the world. It just so happens that the person's device is using this particular time server operated by the Navy.

In general, it is tricky to communicate device activities to users. Information could be missing. If it is not missing, we need to be careful not to spook users if they do not have a strong technical background.

Challenge 3: Incentivizing users.

How do we convince more users to use and keep using the IoT using IoT Inspector? This is a crowdsourced study. We need more users and more user engagement. Since we do not pay them, we need to build a better product for them.

Currently, we do not have a large number of active users, and many of our users have not labeled their devices. As of June 2022, our users have collectively scanned for more than 200,000 devices. Users inspected about a third of these devices—meaning that they had IoT Inspector capture and analyze the device traffic. These devices correspond to the 6,400+ users, but the median duration of running IoT Inspector is about 40 minutes. Only about a quarter of the devices were labeled, and these correspond to about 2,900+ users.

So there is room for improvement—for instance, longer duration beyond 40 minutes. Can we encourage users to run IoT Inspector for days? Can we get more users to inspect more devices and label more devices?

Since we are not paying the users, we can potentially attract more users and keep them engaged longer by building a better product—one with a better user interface and user experience, for example, by offering more usable information about what their devices are doing behind the scenes. We are currently partnering with Consumer Reports to polish the user experience with a team of professional UI/UX developers. Furthermore, we can also increase the duration of use by deploying IoT Inspector on Raspberry Pi.

To find out more about what our users want, we conducted a number of focus groups with users. One common theme is that users want not just more information about their smart home security, but also how they can take actions. Can we let users block devices and certain connections? For example, a user could have an indoor camera. Can IoT Inspector block the camera when the user is at home but unblock the camera when the user is out? These are some of the work-in-progress for the upcoming IoT Inspector version.

Summary

The past three years have been eye-opening for us. We gathered a large traffic dataset of devices from a large number of users around the world. We faced a lot of difficulties as both operators of IoT Inspector and researchers using the dataset. As we work on the next major release, we welcome readers to experiment with the code, share feedback, and even collaborate with us.

Just as ImageNet is a useful tool for computer vision researchers, we want to make IoT Inspector a useful platform for smart home researchers—not only in security and privacy, but also in other fields such as networking and machine learning—for years to come.

Appendix

References:

[1] IoT Inspector: Crowdsourcing Labeled Network Traffic from Smart Home Devices at Scale. Danny Yuxing Huang, Noah Apthorpe, Gunes Acar, Frank Li, Nick Feamster. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT / Ubicomp). 2020.

Article Categories:

Security

IoT

Last updated September 6, 2023

Authors:

Danny Yuxing Huang is an Assistant Professor in New York University. He is broadly interested in the security, privacy, and usability of consumer-facing technologies.

dhuang@nyu.edu