S. E. Coull M. P. Collins C. V. Wright F. Monrose M. K. Reiter
Johns Hopkins University | Carnegie Mellon University |
{coulls,cwright,fabian}@cs.jhu.edu | mcollins@cert.org, reiter@cmu.edu |
Our goal in this paper is to evaluate the strength of current anonymization methodology in achieving goal (ii). Specifically, we focus on providing a realistic assessment of the feasibility of identifying individual web pages within anonymized NetFlow logs [4]. Our work distinguishes itself from prior work by operating on flow-level data rather than packet traces, and by carefully examining many of the practical concerns associated with implementing such identification within real network data. Previous work has focused on methods for web page identification within encrypted or anonymized packet trace data utilizing various packet-level features, such as size information, which cannot be readily scaled to flow-level data. Rather than assume the presence of packet-level information, our work instead focuses on the use of flow-level data from NetFlow logs to perform similar identification. Since NetFlow data contains a small subset of the features provided in packet traces, we are able to provide a general method for identifying web pages within both packet trace and NetFlow data. Also, use of NetFlow data is becoming more commonplace in network and security research [13,21,33,5].
More importantly, our primary contribution is a rigorous experimental evaluation of the threat that web page identification poses to anonymized data. Though previous work has provided evidence that such identification is a threat, these evaluations do not take into account several significant issues (e.g., dynamic web pages, browser caching, web session parsing, HTTP pipelining) involved with the application of deanonymizing techniques in practice. To overcome these obstacles to practical identification of web pages, we apply machine learning techniques to accommodate variations in web page download behaviorThough machine learning techniques are certainly not the only method for handling variability in web pages, their application in this context seems to be intuitive.. Furthermore, our techniques can parse and identify web pages even within multiple interleaved flows, such as those created by tabbed browsing, with no additional information. The crux of our identification method lies in modeling the web servers which participate in the download of a web page, and using those models to find the corresponding servers within anonymized NetFlow data. Since the behavior of each server, in terms of the flows they serve, is so dynamic, we apply kernel density estimation techniques to build models that allow for appropriate variations in behavior.
Simply finding web servers is not enough to accurately identify web pages, however. Information such as the order in which the servers are contacted, and which servers are present can have significant impact on the identification of web pages. In fact, the ordering and presence of these servers may change based on various download scenarios, such as changes in browser cache or dynamic web page content. To capture these behaviors, we formalize the game of ``20 Questions'' as a binary Bayes belief network, wherein questions are asked to narrow the possible download scenarios that could explain the presence of a web page within the anonymized data. As such, our approach to web page identification begins with identifying likely servers and then employs the binary Bayes belief network to determine if those servers appropriately explain the presence of the targeted web page within the data.
Lastly, the evaluation of our techniques attempts to juxtapose the assumptions of closed world scenarios used in previous work to the realities of identifying web pages in live network data. The closed world evaluation of data collected through automated browsing scripts within a controlled environment was found to perform well -- detecting approximately 50% of the targeted web pages with less than 0.2% false detections. In more realistic scenarios, however, true detection and false detection rates varied substantially based upon the type of web page being identified. Our evaluation of data taken through controlled experiments and live network captures shows that certain types of web pages are easily identifiable in real network data, while others maintain anonymity due to false detections or poor true detection rates. Additionally, we show the effects of locality (i.e., different networks for collecting training and testing data) on the detection of web pages by examining three distinct datasets taken from disparate network environments. In general, our results show that information leakage from anonymized flow logs poses a threat to web browsing privacy insofar as an attacker is able to approximate the basic browser settings and network conditions under which the pages were originally downloaded.
Network trace anonymization is an active area of research in the security community, as evidenced by the ongoing development of anonymization methods (e.g., [9,23,30]) and releases of network data that they enable (e.g., [26,7]). Recently, several attacks have been developed that illustrate weaknesses in the privacy afforded by these anonymization techniques. In particular, both passive [6] and active attacks [2,3] have shown that deanonymization of public servers and recovery of network topology information is possible in some cases. Until now, however, an in-depth examination of the extent to which the privacy of web browsing activities may also be at risk has been absent.
It would appear that existing approaches for inferring web browsing activities within encrypted tunnels [19,32,11,1,18,8]) would be directly applicable to the case of anonymized network data--in both cases, payload and identifying information (e.g., IP addresses) for web sites are obfuscated or otherwise removed. These prior works, however, assume some method for unambiguously identifying the connections that constitute a web page retrieval. Unfortunately, as we show later, this assumption substantially underestimates the difficulty of the problem as it is often nontrivial to unambiguously delineate the flows that constitute a single page retrieval. The use of NetFlow data exacerbates this problem. Furthermore, as we show later, there are several challenges associated with the modern web environment that exacerbates the problem of web page identification under realistic scenarios.
To our knowledge, Koukis et al. [14] present the only study of web browsing behavior inference within anonymized packet traces, which anticipates some of the challenges outlined herein. In their work, however, the authors address the challenges of parsing web page downloads from packet traces by using packet inter-arrival times to delineate complete sessions. Though this delineation can be successful in certain instances, there are several cases where time-based delineation alone will not work (e.g, for interleaved browsing). In this paper, we address several challenges beyond those considered by Koukis et al. and provide a more in-depth evaluation that goes further than their exploratory work. Moreover, our work differs from all prior work on this problem (of which we are aware) in that it applies to flow traces, which offer far coarser information than packet traces.
The anonymized NetFlow data we consider consists of a time-ordered sequence of records, where each record summarizes the packets sent from the server to the client within a TCP connection. These unidirectional flow records contain the source (server) and destination (client) IP addresses, the source and destination port numbers, timestamps that describe the start and end of each TCP connection, and the total size of the traffic sent from the source to the destination in the flow (in bytes). The NetFlow format also contains a number of other fields that are not utilized in this work. For our purposes, we assume that the anonymization of the NetFlow log creates consistent pseudonyms, such as those created by prefix-preserving anonymization schemes [9,23], for both the source and destination IP addresses in these records. Furthermore, we assume that the NetFlow data faithfully records TCP traffic in its entirety.
The use of consistent pseudonym addresses allows us to separate the connections initiated from different hosts, thereby facilitating per host examination. Additionally, we assume that port numbers and sizing information are not obfuscated or otherwise altered to take on inconsistent values since such information is of substantial value for networking research (e.g., [10,29,12]). The unaltered port numbers within the flows allow us to filter the flow records such that only those flows originating from port 80 are examinedNote that even if this assumption did not hold there are still techniques that can be used to infer the presence of HTTP traffic (e.g, based on traffic-mix characteristics).
Initially, we also assume that web browsing sessions (i.e., all flows that make up the complete download of a web page) can be adequately parsed from the NetFlow log. A similar assumption is made by Sun et al. [32] and Liberatore et al. [18]. Though previous work has assumed that web browsing session parsing algorithms are available, accurate web session parsing is, in fact, difficult even with packet traces and access to payload information [31,15]. In §6, we return to the difficulty of parsing these sessions from real anonymized network data. By adopting the assumption (for now) that accurate web browsing session parsing can be done, it becomes possible to parse the complete NetFlow data into non-overlapping subsequences of flow records, where each subsequence represents a single, complete web browsing session for a client. Given the subsequent client web browsing sessions, our goal is to extract features that uniquely identify the presence of target web pages within the anonymized NetFlow data, and model their behavior in a manner that captures realistic browsing constraints.
An important observation regarding this inconsistency is that the size of any flow is regulated by the cumulative size of all the objects downloaded for the web page, less the size of all objects downloaded in prior flows. If a large flow early in the browsing session retrieves a significant number of objects, then the subsequent flows must necessarily become smaller, or there must be fewer flows overall. Conversely, a session of many small flows must necessarily require more flows overall. In fact, if we examine the cumulative perspective of web page downloads in Figure 1(b), we find that not only are these sites distinguishable, but that they take consistent paths toward their target cumulative size.
The existence of such paths and the inherent connection between flow size, index number, and cumulative size motivates the use of all three features in identifying web pages. These three features can be plotted in 3-dimensional space, as shown in Figure 2(a), and the path taken in this 3-dimensional space indicates the behavior exhibited by the download of objects for a complete web browsing session. Figure 2(b) shows an example of web browsing session paths for the front pages of both yahoo.com and msn.com overlaid on the set of points taken over many web browsing sessions of msn.com's front page. Clearly, the path taken by yahoo.com is distinct from the set of points generated from web browsing sessions of msn.com, while the msn.com path remains similar to past web browsing sessions.
(a) msn.com Server 1 | (b) msn.com Server 2 | (c) msn.com Server 3 |
Of course, the creation of robust models for the detection of web pages requires that the data used to create the models reflect realistic behaviors of the logical servers and the order in which they are contacted. There are a number of considerations which may affect the ability of data to accurately predict the behavior of web page downloads. These considerations are especially important when an attacker is unable to gain access to the same network where the data was collected, or when that data is several months old. Liberatore et al. have shown that the behavioral profiles of web pages, even highly dynamic web pages, remain relatively consistent even after several months [18], though the effects of web browser caching behavior and the location where the network data was captured have not yet been well understood.
There are several steps, illustrated in Figure 4, that must be performed on the anonymized NetFlow logs in order to accurately identify web pages within them. Our first step is to take the original NetFlow log and parse the flow records it contains into a set of web browsing sessions for each client in the log. Recall that our initial discussion assumes the existence of an efficient and accurate algorithm for parsing these web browsing sessions from anonymized NetFlow logs. These web browsing sessions, by definition, consist of one or more physical server sessions, which are trivially parsed by partitioning the flow records for each client, server pair into separate physical server sessions. The physical server sessions represent the path taken within the 3-dimensional space (i.e., flow size, cumulative size, and index triples) when downloading objects from the given physical server. At this point, we take the paths defined by each of the physical servers in our web browsing session, and see which of the logical servers in our classifier it is most similar to by using kernel density estimates [28]. Therefore, a given physical server is mapped to one or more logical servers based on its observed behavior. This mapping indicates which logical servers may be present within our web browsing session, and we can characterize the identity of a web page by examining the order in which the logical servers were contacted using a binary belief network. If we can satisfy the constraints for our classifier based upon the logical servers present within the web browsing session, then we hypothesize that an instance of the web page has been found. In §4.1 and §4.2, we discuss how the kernel density estimates and binary belief networks are created, respectively.
In general, the kernel density estimate (KDE) [28] uses a vector of samples, to derive an estimate for a density function describing the placement of points in some -dimensional space. To construct a KDE for a set of samples, we place individual probability distributions, or kernels, centered at each sample point . In the case of Gaussian kernels, for instance, there would be Gaussian distributions with means of , , ..., , respectively. To control the area covered by each distribution, we can vary the so-called bandwidth of the kernel. For a Gaussian kernel, its bandwidth is given by the variance (or covariance matrix) of the distribution. Intuitively, a higher bandwidth spreads the probability mass out more evenly over a larger space.
Unfortunately, determining the appropriate bandwidth for a given set of data points is an open problem. One approach that we have found to produce acceptable results is to use a ``rule of thumb'' developed by Silverman [28] and refined by Scott [27]. The bandwidth is calculated as , for each dimension , where is the number of kernels, is the number of dimensions for each point, and is the sample standard deviation of the dimension from the sample points in . The primary failing of this heuristic is its inability to provide flexibility for multi-modal or irregular distributions. However, since this heuristic method provides adequate results for the problem at hand, we forego more complex solutions at this time.
Once the distributions and their associated bandwidths have been placed in the 3-dimensional space, we can calculate the probability of a given point, , under the KDE model as:
To evaluate an anonymized physical server session on a particular KDE model, we simply evaluate each point in the path for that physical server session using Eqn. 1, and calculate the total probability of the given physical server session, , as:
The use of path probabilities alone, however, is insufficient in uniquely describing the behavior of the logical server. To see why, consider the case where we have a model for a logical server which typically contains ten or more total flows. It may be possible for a much smaller physical server session with one or two flows to achieve a non-zero probability despite the fact that there clearly was not an adequate amount of data transferred. To address this, we also create a KDE model for the final points in each sample path during training, denoted as end points. These end points indicate the requisite cumulative size and number of flows for a complete session with the given logical server. As before, we create distributions around each sample end point, and calculate the probability of the physical server session's end point by applying Eqn. 1. Any anonymized physical server session which has a non-zero probability on both their path and their end points for a given logical server model is mapped to that logical server.
To address this, we apply a second heuristic that merges these remaining physical servers by examining the behavior exhibited by their KDE models. If a randomly selected path and its end point from a given physical server's training data achieves non-zero probability on the KDE model of another physical server, then those two physical servers can be merged into a single logical server. The combination of these two heuristics allow us to reliably create KDE models that represent the logical servers found in the web browsing session. By applying the points found in an anonymized physical server session to each of the KDE models for a given web page, we can create candidate mappings from the anonymized physical server to the logical servers for the target web page.
As discussed earlier, we formalize the constraints on the logical servers using a binary Bayes belief network (BBN). In a typical Bayes network, nodes represent events and are inter-connected by directed edges which depict causal relations. The likelihood that a given event occurs is given by the node's probability, and is based on the conditional probability of its ancestors. In the binary Bayes belief network variant we apply here, we simply use a binary belief network where events have boolean values and the causal edges are derived from these values [20].
An example of a binary belief network is given in Figure 5, where the probability of event is conditioned upon event . One way of thinking of this network is as a strategy for the game of ``20 Questions'' where the player attempts to identify an object or person by asking questions that can only be answered with `Yes' or `No' responses. Our binary belief network is simply a formalization of this concept (though we are not limited to ask 20 questions), where the answer to any question dictates the best strategy for asking future questions.
To create the belief network, we first decide upon a set of questions (or events) that we would like to evaluate within the data. In the context of web privacy, these events relate to the existence of logical servers, their causal relationships, and cumulative size. The belief network can be created automatically by first examining all possible existence and ordering events that occur within the training data. Next, from this set of events, we can simply select the event whose probability of being True in the training data is highest among all events. Having done so, the training data can then be partitioned into two groups: one group whose data has the value True for the selected event, and another whose value is False for that event. The selected event is then removed from the set of possible events and each partition of training data now selects another event from the remaining set whose probability on their respective data is highest.
This partitioning process is repeated recursively, allowing each branch to grow independently. A given branch halts its recursion when its conditional probability for an event is . The conditional probability threshold, , indicates the percentage of the training data that remains at a given leaf node, and therefore we stop our recursion before the tree becomes overfitted to our training data. Any leaf node that halts recursion with some amount of training data remaining is considered as an accepting node, and all other leaf nodes are labeled as rejecting nodes. Accepting nodes implement one additional check to ensure that the total size of all flows in the web browsing session is within of the total sizes observed during training.
In short, our initial evaluation is considered under controlled environments similar to past work, but with two notable differences. First, in the scenarios we examine, there is substantially less data available to us than at the packet trace level; recall that NetFlows aggregate all packets in a flow into a single record. Second, rather than assuming that the client's browser cache is turned off, we attempt to simulate the use of caching in browsers in our training and testing data. The simulation of browser caching behavior was implemented by enabling the default caching and cookie policies within Mozilla FirefoxTM, and browsing to the sites in our target set at random. Of course, this method of cache simulation is not entirely realistic, as the probability of a cache hit is directly proportional to the frequency with which the user browses that web site. However, in lieu of making any assumptions on the distribution of web browsing for a given user, we argue that for the comparison at hand, the uniform random web browsing behavior provides an adequate approximation.
Our target pages were the front pages of the top 50 most popular sites as ranked by alexa.com. Additionally, we also collected information about the front pages of sites ranked 51-100 on the alexa.com list for use in providing robust evaluation of the false detection rates of our technique. Though we have chosen to evaluate our techniques on the top 100 sites, there is nothing inherent in their structure that differentiates them from other web pages. In fact, the same techniques are equally applicable in targeting any web page of the attacker's choosing.
The web pages were retrieved by running the automated browsing script on a host within the Johns Hopkins University network for a total of four weeks, creating a total of 18,525 web browsing sessions across all 100 web pages in our list of web pages. From this data, we select the first 90 web browsing sessions of our target web pages (i.e., those within the top 50 of the alexa.com ranking) as the training data for the creation of the kernel density estimate (see § 4.1) and binary Bayes belief network models (see § 4.2) that make up the profiles for each target web page. The remaining sessions are used as test data and are anonymized by replacing IP addresses within the NetFlow data with prefix-preserving pseudonyms according to the techniques described by [23]. Notice that since we assume that the web browsing sessions are easily parsed, we can simply use each web browsing session in our test data directly to determine if that web browsing session can be identified as any of the 50 web pages in our target set using the techniques described in § 3.
For ease of exposition, we also partition the 50 target web pages into canonical categories based on the primary function of the web site. Notice that the performance of the canonical classes varies based on the dynamism of the contents in the web page. For instance, some of the more difficult categories in terms of true detection are those whose front page content changes frequently, e.g, cnn.com. Conversely, pages with simple, static content, like passport.net or google.com, can be identified reasonably well. Moreover, those web sites with simple layouts and little supporting infrastructure also tend to fare worst with respect to false detections, while complex, dynamic sites have few, if any, false detections. These initial results hint at the fact that the ability to reliably identify web pages is connected with the complexity and dynamism of the web page. In what follows, we examine whether these results hold under a more realistic examination based on real world browsing.
Fortunately, our kernel density estimate (KDE) and binary Bayes belief network (BBN) models can be modified to overcome the challenges of web browsing session parsing without significant changes to our identification process. In our previous evaluation, we assumed that the KDE and BBN models were given a subsequence of the original NetFlow log that corresponded to a complete web browsing session for a single client. For our real world evaluation, however, we remove this assumption. Instead, to parse the NetFlow log, we assume that all flows of a given web browsing session are clustered in time, and partition the NetFlow log into subsequences such that the inter-arrival times of the flows in the partition is seconds. This assumption is similar to that of Koukis et al. [14] and provides a coarse approximation to the web browsing sessions, but the resultant partitions may contain multiple web browsing sessions, or interleaved sessions.
Notice that we can simply use each of the physical servers within these partitions as input to the KDE models for a target web page to determine which logical servers may be present in the partition. Thus, we apply the flows from every physical server in our partitioned NetFlow data to the KDE models for our target page to create the logical server mappings. If a physical server in our partition does not map to a logical server, we ignore that physical server's flows and remove it from the partition. Thus, by removing these unmapped physical servers, we identify a candidate web browsing session for our target site. Since the BBN operates directly on the mappings created by the KDE models, we traverse the BBN and determine if the web page is present based on the physical servers that were properly mapped. This technique for finding web browsing sessions is particularly robust since it can find multiple web pages within a single partition, even if these web pages have been interleaved by tabbed browsing.
|
The effects that the changes to our models have on the performance of our technique are shown in Table 2. Clearly, the false detection rate increases substantially, but the true detection rate also increases. As in the closed world scenario, we find that the web pages with constantly changing content are more difficult to detect than static web pages, and that those sites with complex structure (i.e., many logical servers, and many flows) achieve a significantly lower false detection rate than those sites with simple structure. The substantial change in performance can be explained by the relaxation of the BBN constraints to allow for web browser session parsing. This relaxation allows any web browsing session where a subset of physical servers meets the remaining constraints to be identified, thereby causing the increase in both true detection and false detection rate. A more detailed analysis of the implications of these effects is provided in §7.
It is often the case that published network data is taken at locations where an attacker would not have access to the network to collect training data for her models, and so we investigate the effect that the change in locality has on the performance of our technique. The results in Table 2 show that there is, indeed, a drop in performance due to changes in locality, though trends in true detection and false detection rates still hold. In our evaluation, we noticed that the Johns Hopkins data used to train our web page models included a web caching server that caused significant changes in the download behavior of certain web pages. These changes in behavior in turn explain the significant difference in performance among data collected at different localities. It would appear that these results are somewhat disconcerting for a would-be attacker, since she would have to generate training data at a network that was different from where the anonymized data was captured. However, she could make her training more robust by generating data on a number of networks, perhaps utilizing infrastructure such as PlanetLab [25], though the effects of doing so on the performance of the technique are unknown. By including web page download behavior from a number of networks, she can ensure that the KDE and BBN models for each target web page are robust enough to handle a variety of network infrastructures.
|
To better understand this difference in our classifier's performance for different web pages, we examine three features of the page loads in our automated browsing sessions for each site: the number of flows per web browsing session, number of physical servers per web browsing session, and the number of bytes per flow. Our goals lie in understanding (i) why two sites within the same category have wildly different performance, and (ii) why simple web pages introduce so many more false detections over more complex web pages. To quantify the complexity of the web page and the amount of variation exhibited in these features, we compute the mean and standard deviation for each feature across all observations of the web page in our training data, including both simulated cache and unlimited cache scenarios. The mean values for each feature provide an idea of the complexity of the web page. For instance, a small average number of physical servers would indicate that the web page does not make extensive use of content delivery networks. The standard deviations tell us how consistent the structure of the page is. These statistics offer a concise measure of the complexity of each web page, enabling us to objectively compare sites in order to determine why some are more identifiable than others.
Returning to the difference in true detection rates for cnn.com and nytimes.com, the front pages of both sites have constantly changing news items, a significant number of advertisements, and extensive content delivery networks. One would expect that since the web pages change so rapidly, that our accuracy in classifying them would be comparable. Instead, we find that we can detect nearly half of the occurrences of cnn.com in live network data, while we never successfully detect nytimes.com. In Table 4, we see that both web pages have similar means and standard deviations for both flow size and number of physical servers. This similarity is likely due to the nature of the content these two sites provide. However, the number of flows per web browsing session for nytimes.com is nearly double that of cnn.com. Moreover, nytimes.com exhibits high variability in the number of flows it generates, while cnn.com seems to use a fairly stable number of flows in all web browsing sessions. This variability makes it difficult to construct high-quality kernel density estimates for the logical servers that support nytimes.com, so our detector necessarily performs poorly on it.
Another interesting result from our live network evaluation is that some web page models appear to match well with almost all other pages, and therefore cause an excessive amount of false detections. For instance, Table 3 shows that yahoo.com has an exceptionally low false detection rate among all web pages in our live traffic, while google.com has one of the highest false detection rates. Both web pages, however, provide adequate true detection rates. In Table 5, we see that google.com and yahoo.com have very distinct behaviors for each of the features.
The metrics for our three features show that google.com transfers very little data, that there is almost always only one physical server in the web browsing session, and that there are normally only one or two flows. On the other hand, yahoo.com serves significantly more data, has a substantial number of physical servers, and causes the browser to open several flows per web browsing session. The web browsing sessions for google.com and yahoo.com both exhibit very little variability, though yahoo.com has more variability in its flow sizes due to its dynamic nature. Since both web pages have relatively low variability for all three features, they are both fairly easy for our techniques to detect, which corroborates our earlier claim that cnn.com is easy to detect because of the relative stability of its features. However, since google.com is so simplistic, with only a single physical server and very few flows on average, its BBN and KDE models have very few constraints that must be met before the detector flags a match. Hence, many physical servers in a given NetFlow log could easily satisfy these requirements, and this causes the detector to produce an excessive number of false detection. By contrast, the models for yahoo.com have enough different logical servers and enough flows per session that it is difficult for any other site to fit the full description that is captured in the BBN and KDE models.
When evaluating the threat that our web page identification attack poses to privacy, it is prudent to consider the information an attacker can reliably gain, possible practical countermeasures that might hamper such attacks, and the overarching goals of network data anonymization. With the techniques presented in this paper, an attacker would be able to create profiles for specific web pages of interest, and determine whether or not at least one user has visited that page, as long as those target web pages were of sufficient complexity. While the attacker will not be able to pinpoint which specific user browsed to the page in question with the technique presented in this paper, such information leakage may still be worrisome to some data publishers (e.g., web browsing to several risqué web pages).
There are, however, practical concerns that may affect the attacker's success aside from those described in this paper, such as the use of ad blocking software and web accelerators that dramatically alter the profiles of web pages. These web browsing tools could be used to make the attacker's job of building robust profiles more difficult, as the attacker would not only have to adjust for locality effects, but also for the effects of the particular ad blocking software or web accelerators. Moreover, while our evaluation has provided evidence that certain classes of web pages are identifiable despite the use of anonymization techniques, it is unclear how well the true detection and false detection rates scale with a larger target web page set. Therefore, our techniques appear to be of practical concern insofar as the attacker can approximate the behavior of the browsers and network environment used to download the web page.
In this paper, we perform an in-depth analysis of the threats that publishing anonymized NetFlow traces poses to the privacy of web browsing behaviors. Moreover, we believe our analysis is the first that addresses a number of challenges to uncovering browsing behavior present in real network traffic. These challenges include the effects of network locality on the adversary's ability to build profiles of client browsing behavior; difficulties in unambiguously parsing traffic to identify the flows that constitute a web page retrieval; and the effects of browser caching, content distribution networks, dynamic web pages, and HTTP pipelining. In order to accommodate for these issues, we adapt machine learning techniques to our problem in novel ways.
With regard to realistic threats to anonymized NetFlow data, our results show that there are certain web pages whose behavior is so variable that they may be very difficult to detect in practice. Also, our techniques offer an attacker little hope of accurately identifying small, simple web pages with a low false detection rate. However, complex web pages appear to pose a threat to privacy. Finally, our results show that an attacker must consider the effects of locality on the training data used to create the target web page models.
Our results and analysis seem to contradict the widely held belief that small, static web pages are the easiest target for identification. This contradiction can be explained by the distinct differences between closed world testing and the realities of identifying web pages in the wild, such as browser caching behavior and web browsing session parsing. On the whole, we believe our study shows that a non-trivial amount of information about web browsing behaviors is leaked in anonymized network data. Indeed, our analysis has demonstrated that anonymization offers less privacy to web browsing traffic than once thought, and suggests that a class of web pages can be detected in a flow trace by a determined attacker.