I am a leading government researcher investigating defensive deception as a tool to protect networks and detect and impede intruders. In this article, I summarize the findings from a series of experiments on the use of decoys and their impact on cyber attackers. In short, deception works against experienced red teamers, slowing down their forward progress, confusing them, and setting off alerts that reveal their presence. These techniques work by exploiting innate human limitations of cyber attackers. The human aspects of cyber are a critical, though often overlooked, component of cybersecurity.
The utility of “realistic enough” honeypots [1] in cyber deception applications has been extensively demonstrated in the literature and practical scenarios. Honeypots are an established method for luring an attacker into a “safe”, high-fidelity, high-interaction environment, which not only distracts the attacker from the real assets on the network, but allows defenders to collect and study attacker behavior, including tactics, techniques and procedures (TTPs). The threshold, “realistic enough”, as well as the “effort versus reward” equation to determine when it is worthwhile for a specific network to use a honeypot for defense has not been sufficiently studied. Extensive effort is required to ensure a honeypot is “safe”, “realistic enough”, high-fidelity, high-interaction, interesting, and current. This effort raises the deployment cost too high for many networks, and not all companies and organizations have the staffing or desire to collect and study attacker TTPs.
There are many other cyber deception techniques, less understood and less well-known, which have potential to be a better fit for most networks [2]. For example, decoys and honeytokens can be used as an early warning system for attacker activity on a network. They use deception in the form of low-cost fake assets (decoys) or data (honeytokens). When attackers interact with the assets or use the data, a high-confidence alert is triggered indicating the presence, and location, of the attacker in the network. These, often under-utilized, defenses can be deployed easily on any network to help identify malicious activity (see [14] for related work). While companies and technologists have been advocating the use of cyber deception based on theoretical advantages, in this article I review a series of controlled experiments, executed in conjunction with multiple collaborators, which start to provide an evidence-based foundation for the effectiveness of cyber deception for cyber defense. I highlight results from the experiments as well as discuss challenges and lessons learned that may help the design of future experiments.
The Initial Case Study
The initial case study [3] followed an established three-person red team on a live operational research network through a series of four separate events spaced roughly six months apart. They worked together and were asked to think out loud to explain their actions, thoughts, and strategies, all of which were recorded. Their task was to “recon the network, define its topology, and locate and exfiltrate content of interest”.
In Phase 1, decoys were present on the network, but the participants were unaware of them. The goal of the test was to evaluate the realism of the low-fidelity, low-interaction decoys when presented to unsuspecting experts. In Phase 1 (with no knowledge of decoys), the red team focused their attacks on the most vulnerable looking assets on the network (Windows XP, at the time). While the participants were troubled by the high number of failed exploit attempts, it became apparent that the experts were more likely to question their own skills (i.e., “I didn’t type that right”) or the hacker tools (i.e., “This exploit isn’t reliable”), rather than the veracity of the network assets themselves. It was identified that the majority of exploits (83%) in Phase 1 were launched against the decoys, even though the decoys only encompassed 19% of the assets on the subnet. The low-cost decoys were found to be effective against the red team.
In Phase 2, the experts were made aware of the presence of decoys before interacting with the network. While this knowledge did change their behavior, it didn’t help them identify or avoid the decoys. Instead, the knowledge of decoys seemed to increase paranoia and caution and the red team avoided the Windows XP machines and focused on Windows 7, even though both operating systems were represented by both real and decoys assets. This change in attacker behavior gave our research team a new idea. Is the idea of deception enough to influence attacker behavior? Could this potentially help protect real, vulnerable or out-of-date assets that can’t be removed or patched? Phase 3 was included to address these questions.
In Phase 3, red teamers were informed that there were decoys on the network when, in reality, there were no decoys present. Participants spent a lot of time trying to identify the decoys. They looked for assets that had different/unusual appearance/behaviors, and then assumed they were decoys. The participants, with very high confidence, rated some assets as decoys and, importantly, never determined that there were no decoys present in the network.
In Phase 4 additional information was provided to the red team with the goal of determining if decoys are still effective when an attacker knows they are present as well as the technical details of how they work. Participants were provided the user guide and marketing material for the decoy system in advance. The experts stated that the decoys were still harder to identify than they were expecting. However, they were able to avoid being detected on the network as they now knew what triggered an alert. In order to avoid detection, they had to revert to slower and older attack tactics and refrain from sending any packets on the network. The results from this phase provided some information on deployment strategies for those considering decoys in their network. Slowing and reducing attacker behavior in this manner can be viewed as a win for defenders. While it may not be possible to keep all attackers out, it is advantageous to slow them down while the defenders work to locate and mitigate them. These results from each phase in this initial case study were promising. We believed that network-based deception like decoys are a worthwhile, low-cost defense. However, more data was needed. There were several limitations of this initial work we sought to correct in the next experiment: 1) no control condition; 2) low number of participants; 3) potential for learning effects from participants taking part in each condition.
The Tularosa Study
When the follow-on experiment, the Tularosa Study [4], was designed there were several trade-offs that needed to be made in order to increase the power of the results with a larger number of participants:
Initial Case Study | Tularosa Study |
---|---|
Worked as a Team | Worked as Individuals |
Operational network | Test network |
Same participants across all conditions | Different participants for each condition |
Think-aloud protocol; 1-on-1 observations | Real-time chat logs; post-exercise surveys and reports |
Used own laptop and tools | Provided standardize laptop and tools |
3-person team of experts | 130+ expert participants |
One day campaigns | One day campaigns |
While the ideal experiment, in our opinion, would use features from both the case study and the Tularosa Study (factors shown in bold in the above chart), scientific rigor necessitated deviation from the ideal. When the number of participants was greatly increased for the Tularosa Study, design decisions were made to support experimental validity. For example, internal validity was best served by providing standard tools and identical networks, ensuring performance differences were attributable to the manipulation of deception and information. If participants were allowed to use their own tools, while this could improve ecological validity, it would be hard to determine if a difference in performance was due to the unique tools used or the experimental condition. Similarly, while using an operational network is best for external validity, in order to maintain internal validity and the ability to compare results, each participant needed to be presented with the same network state and attack vectors on a test network.
The conditions for Day 1 of the Tularosa Study included: decoys-absent/deception-uninformed (AU); decoys-absent/deception-informed (AI); decoys-present/deception-uninformed (PU); decoy-present/deception-informed (PI).
Design Decisions
The experimental design decisions for the Tularosa Study were made in collaboration with human behavior and cyber security experts [10]. The design did not include specific flags to capture—an additional realism versus repeatability trade—off choice. Capture-the-flag (CTF) procedures are a common scoring criteria for cyber events. However, it was decided that allowing participants the freedom to decide what they deemed reportable would best reveal what they perceive as important. As discussed in depth in [8], this design decision was made to remove potential faulty extrinsic motivation caused by flags, and skewing the human behavior we desire to study, even though the collection of flags may increase speed and ease of judging success. In retrospect, the penetration task was left too open-ended which caused an increase in difficulty for scoring attacker success, thus increasing the resources needed for analysis, and delaying the publication of results. Some middle ground between a typical CTF and the design used for Tularosa would be preferred for future work. Ultimately, there are aspects of a cyber dataset that become less useful as it ages. If the time to analyze it takes too long, the utility of the information can decay since the same technologies and strategies may not still be in use in the wild.
When designing a large-scale experiment involving human subjects, there are numerous considerations including: 1) internal validity, 2) external validity, 3) practicality/logistics, 4) cost, 5) time, and 6) potential impact of results. Support of experimental validity is the most commonly discussed in the literature, but all are critical to consider. For example, logistical challenges required that most of the sessions be complete by the end of the year (2017). Securing expert participants for two days is a great expense, and the funding used to pay participants was set to expire at the end of the calendar year. Pilots were completed to test the instructions and the cyber range, as in all studies, small issues were still discovered once the experiment began, or after completion of the study. Lessons learned and recommendations for future studies are summarized near the end of this article.
Another consideration was the placement of decoy assets: should decoys replace real assets or simply be added to the network? In option 1, the decoy-absent conditions would have 50 real assets, and the decoy-present conditions would have 25 real, and 25 decoy. This option was attractive because the total number of machines in the network remains constant across all conditions. However, which real assets should be removed? How would you perform direct comparisons between decoy-absent participant interaction with machines that were removed, with the decoy-present participant interactions with decoys? In option 2, the decoy-absent conditions would have 50 real assets and the decoy-present conditions would have 50 real and 50 decoy. This allowed direct comparison for interactions with the real assets across all conditions, but increased the difficulty and time-consuming nature of the task for participants in the decoy-present condition by presenting them with more targets. The research team selected option 2, because that is the scenario most similar to how decoys would be used in the wild. A caveat was added to the statistical results discussing that the time delays and difficulty of the decoy-present condition was affected by the additional number of targets, but that is the nature of decoy deployments and is part of their effectiveness.
In the initial case study, when participants were informed, they were specifically told that there were decoys on the network, and that they had been fooled by them in Phase 1. They didn’t know technical details until phase 4, but they had some experience to draw from. This may have increased their paranoia and caution. However, in the Tularosa Study, informed participants were told “There may be deception on the network.” This was done in an attempt to closely mimic reality, where attackers may not know if deception is being used on a specific network. And it was hypothesized that the uncertainty could increase the impact. In retrospect, this vagueness allowed too many different interpretations of deception; in future studies we recommend providing specific definitions for key terms, including: cyber deception, decoys, blue team, etc.
Initial analysis performed on a set of the Tularosa Study data has provided statistically significant results indicating that decoys are effective at impeding attacker forward progress [5]. The analysis also indicated that the combination of the presence of decoys and attacker knowledge of their use had the largest impact. This was a radical finding since many believed that deception techniques must be kept strictly hidden to be effective. There were also indications that the deception had an effect on the cognitive and emotional state of the participants. We continue to study these effects, and how they can improve cyber defense, through our Cyberpsychology and Oppositional Human Factors research, where we seek to identify, induce, and exacerbate known human limitations and deficiencies in human performance in cyber [6,7].
Results
In this article, high-level details of the data collected and research results from the Tularosa Study are provided in order to motivate network owners to use decoys and other cyber deception techniques to bolster their defenses. For detailed statistical analysis, please see [5, 9].
Highlights of previous findings include that when decoys were present:
- Every participant triggered a decoy alert prior to exploiting a real asset on the network.
- 52% of commands targeted decoys (based on commands containing IP addresses).
- 35% of network packets sent targeted decoys, and the quantity of MegaBytes (MB) sent to real assets decreased by 25% (based on packet capture (PCAP)).
- There was a 100% increase in EternalBlue exploit failures, and a 65% decrease in successes (based on OCR detection of attack client video recordings). The EternalBlue exploit was fairly new at the time of the experiment, and was the most prevalent exploit used by participants.
Intrusion Detection System (IDS) alert data suggest that when decoys are present, attackers are easier to detect, and real machines are attacked less often:
- 42% more IDS alerts per person were triggered in Present conditions (based on Snort alerts).
- When decoys were present, 30% more IDS alerts were triggered on decoys than real assets, and the number of alerts triggered on real assets was reduced by 44% per person due to effort wasted on decoys (based on Snort alerts).
A combination of the presence of and information about deception had the most effect on cyber behavior:
- On average, informed decoy-present (PI) participants misidentified 45% more assets (i.e., decoy as real, real as decoy, incorrect operating system) than uninformed decoy-present (PU) participants.
- On average, twice as many decoys were targeted in the informed decoy-present (PI) condition, than in the uninformed decoy-present (PU) condition.
- More participants attempted to leverage stolen credentials in the control condition (53%; AU) than in the informed/decoy-present (25%; PI) condition.
The impact of cyber deception has been replicated in other controlled experiments, such as the Moonraker Study, which investigated the use of a variety of host-based deception techniques [17]. When deception was Present:
- Significantly fewer participants successfully completed the cyber task, demonstrating deception was a disruption of overall success.
- The proportion of successful TTP commands was significantly less, demonstrating deception impeded forward progress.
- For participants who did complete the task, those with deception-present took significantly more time than those in the deception-absent condition, demonstrating delay caused by deception.
- Participants spent significantly more of their time attempting to connect to decoys (3.5 times the number of minutes spent on real hosts).
Participant Recruitment and Population
Recruitment Process
Participants were recruited via a contracting process to solicit bids from U.S. companies for qualified experts to participate in a red teaming exercise. The study participants were compensated for their work directly or via subcontracts through their affiliated companies. Upon arriving to the study, participants were asked whether they would also like to be part of a human subject research (HSR) study as part of the cyber exercise. Those who opted-in provided physiological, demographic, experience, and cognitive data in addition to the network penetration task and task-specific questionnaires. (For details see [11]; they were offered a $25 Amazon gift card for their participation. Over 95% of participants opted into the HSR portion; only six did not volunteer.)
Institutional ethics review boards (IRB) approval was received for the experimental design from all relevant institutional ethics review boards (IRB). The IRB determined that the portion of the tasks that aligned with normal red team activity are not HSR and thus could be included in contracted work. However, the portion that collected data about the participants, their cognition, and their physiology is HSR and thus was completely voluntary. No personal identifying information (PII) was collected and all experimental data was anonymized. No cyber task performance or human subjects research (HSR) information was provided back to any of the participants' employers. As this research moves forward with the design of new experiments as part of future work, it is unknown if those included in previous studies can/should be excluded from future HSR on the same topic. Excluding previous participants from future experiments may be difficult since all previous participation was anonymized and any pre-screening questions could inadvertently inform the potential participants to the topic of the study.
Participant Demographics and Experience
Using expert participants was a pivotal part of these experiments. Hiring cyber expert participants to investigate research questions or evaluate novel tools is a costly, yet crucial endeavor. However, there are certainly valuable outcomes that come from other kinds of research studies which I do not have time to discuss in this article. Most previous experiments in the literature used non-experts, such as students or Mechanical Turk performers. The Tularosa participants had an average of 8 years of experience in cyber security, with an average of 5 years in network reconnaissance, and 4 years in both network and host penetration. The participant population was:
- under 50 years-old (89%)
- held a bachelor’s degree or higher (65%)
- male (94%)
- English speaking (95%).
Over 70% reported they typically worked in teams (44% responded typical team size of 2-3; 27% responded typically team size of 4 or more). Only 12% listed their typical duration of engagement as 1-2 days, while over 50% reported a week or more (14% responded 3-7 days; 22% responded 1-2 weeks; 18% responded 2 weeks – 1 month; 18% responded more than 1 month).
Personality and Decision-Making
Characterizing the participant population, based on the personality and decision-making trait surveys administered we found the average participant, when compared to the general population, was determined to be:
- more rational
- less avoidant
- less spontaneous
- less indecisive
- with a higher need for cognition (i.e., pursuit of difficult problems, enjoying the process of thinking)
- higher agreeableness (i.e., predilection towards trust and compliance)
- higher conscientiousness (i.e., level of efficiency and organization
- lower neuroticism (i.e., an irritable, unhappy disposition).
Limited research is available that focuses on the decision making of cyber operators, let alone malicious attackers and cyber criminals; this data is notoriously difficult to collect [18]. While white hat hackers, like red teamers, are a separate population, many novel insights have been gained from the experiments discussed in this article [15, 16]. Human decision-making is a critical but often overlooked component of cybersecurity and I foresee this area of research growing in the near future.
Limitations to the Tularosa Study have been discussed in previous publications [4, 5, 8]. One of the primary limitations was the restricted time frame that participants had to recon and attack the network. While a necessary design feature, due to funding and other logistics, it potentially impacts many aspects of the results. For example, few of the participants correctly identified a decoy in the 2-day time period, and none appeared to use that information for counter deception or to avoid interacting with future decoys. Furthermore, the time constraint likely pushed many participants to behave more aggressively and/or less cautiously than they otherwise would have. Additionally, unbeknownst to the participants, there were no live defenders on the network, and no repercussions for the attackers’ activity, regardless of the noisiness level. While it will be important for future work to examine cyber deception in conjunction with live defenses, defensive responses must be tightly controlled, thus reducing variation to understand how the defensive actions may impact attacker activity (beyond the impact of the deception itself).
Red Teamers
There are many different types of cyber experts, and different terminology has been used to describe them (i.e., white/grey/black hat hackers, red/blue/purple team, APT). The experts who participated in these experiments were professionals who regularly perform a network penetration, or similar, job assigned to them to help improve the security of various networks. The motivation for penetration testing and the like differ greatly from unauthorized, criminal hacking behavior but not necessarily in skill and expertise. While it is clear that these populations have some similarities and differences, specific variations in the personality, behavior, and perceptions between a red team or ethical hacking population and those who perform unauthorized hacking, is not well understood or well-studied. Potentially different characteristics that have been discussed include things like: trust in authority, rule breaking attempts, and destructive cyber behavior. Furthermore, there are even notable difference between APTs and other cyber criminals.
Cultural Differences
Another critical limitation is the use of only U.S. red teamers. While this was the closest available population to the desired population (malicious cyber attackers), it is unknown how results may differ for unauthorized hackers, or hackers from other cultures. Although it seems certain that differences exist (i.e., predilection towards trust and compliance), specific data is not available to the research community.
Excluded Data
Other issues arose during testing, as with any large experiment, that required us to discard some participant data. Fortunately, at most 10 participants were run at a time, and we were able to take steps to ensure these issues did not reoccur. Excluded data included errors in data collection, errors in experimental set-up, one participant not meeting the qualifications, as well as a few participants not returning for day two. The slow data collection process, compounded by the need for participants to travel to the test location, resulted in data collection spanning over a year. Additional factors identified after experimental completion, but not affecting data collection, included varied interpretation of cyber task instructions, potential cross-talk among participants during lunch break, and deliberate attempts to break the rules of engagement.
Data Not Yet Analzed
The Tularosa Study included two distinct days of red teamers attacking a network. On the second day each participant was placed in a different experimental condition and interacted with a new, but (perhaps too) similar, network. This was done to try to investigate how psychological and cyber deception persists over time. For example, if an attacker encounters decoys in one network, will that impact their behavior in a different network during a new campaign? Day 2 data, as well as the physiological data collected is still in the process of being analyzed. Even without full completion of analysis, there are several aspects of the study we have an interest in replicating in order to see how behavior and opinions surrounding cyber deception have evolved in the years since the Tularosa data was collected.
Based on my involvement with various experiments, analysis, and discussions with researchers and customers regarding cyber deception, I am in a position to offer the following recommendations to the community:
- More HSR is needed to better understand the human component of cyber attack and defense. These results can be used to improve the reasoning and decision making of attack/defense models and ensure they better account for realistic human behavior.
- Thought and effort need to be put into designing and instrumenting CTFs and similar events to collect the kinds of data discussed in this article. Much of this data is currently being lost, when it can provide the community with updated, periodic data for a variety of cyber expert skills. More care needs to be taken to support internal and external validity for these events, where possible, and ethical considerations and proper approvals must be in place.
- Be thoughtful about the schedule for running participants during an experiment. Too many participants at once risks the loss or exclusion of large amounts of data. However, a cyber experiment run over too long of a time range, risks the possibility of technology, behaviors, and attitudes shifting due to the rate of change in the cyber domain and time between the first and last participant’s data collection.
- Define the role a participant needs to play during an experiment, with enough specifics provided for them to play it. Understand that certain roles will require specific knowledge and skills. (For Tularosa we attempted to recruit from professionals who are trained to and have demonstrated the ability to emulate attackers.) If that is not possible, ensure the use of a pre-test to establish that each participant’s skill level allows them to accurately play the assigned role. You can ask red teamers to act the part of a malicious attacker, but over-specifying how to act like an attacker (i.e. be very stealthy), can reduce the desired outcome of measuring and understanding realistic behavior. When piloting task instructions, be sure to include participants from the varying skill levels and cyber specialties expected from the participants.
- Although cyber deception techniques will impede attackers and assist defenders, they are not a one-size-fits all solution. The best solution to employ depends on many factors that must be evaluated for each customer including:
- Defender resources, i.e., budget, staffing, processes, technological capabilities
- Defender’s risk acceptance
- Domain, i.e., size, appearance and technology refresh rate of network
- Criticality of what is to be defended
- Cyber expertise, i.e., ability to manage deception system and respond to alerts
- Attacker’s knowledge of the defender
- Defender’s knowledge of the attacker
Simple cyber deception techniques, such as the low-fidelity, low-interaction decoys used in the Tularosa Study, can impede cyber attackers and promote cyber defense for low cost and little effort. Decoy systems are a defensive tool that should be recommended as a supplement to traditional cybersecurity tools, for improved attack detection and network protection. High-fidelity and more complex deception tools likely provide extra benefits, but may not be necessary for defending the average network, and the “effort versus reward” equation is still unknown. While there are a variety of commercially-available cyber deception systems, including decoy systems, there is little publicly available evidence demonstrating the effectiveness of each product. Perhaps this is because the industry sees little benefit to publicly baseline testing the performance of their tools, only potential harm to their sales. The largest reason companies don’t calculate or provide this kind of information is simply because the customers do not demand it. There are some open-source options [12, 13] for those who want to test the waters, but these put a larger development and maintenance burden on the defenders. There is currently no standardized test or evaluation criteria by which cyber deception tools can be evaluated. Gaps in current research may be limiting the use of these tools, which have been shown to be highly effective, low cost solutions. In addition, there are also many aspects that are critical to deciding what deception technologies to deploy and the best way to deploy them.
Cyber defenders are constantly seeking tools that will provide them some advantage over cyber attackers. Cyber deception is one way to increase the cost of attempting a cyber attack. Delaying attacker progress, increasing risk of detection, and reducing the chance that exploits will be successful are ways that cyber deception can impose a cyber penalty on attackers. There is no such thing as a fully protected system and denying access may not be possible. Techniques that introduce cyber penalties to attackers, such as cyber deception, are an essential component of the cyber defense arsenal.