Check out the new USENIX Web site. next up previous
Next: Memorability evaluation Up: Security evaluation Previous: Measures


Empirical results

To affirm our methodology of using $G_{{\cal S}}^{\rm avg}$, $G_{{\cal S}}^{\rm med}$, and $G_{{\cal S}}^{x}$ as mostly stable measures of password quality, we first plot these measures under various instances of Assumption 4.1, i.e., for various values of $\hat{\ell}$ and, for each, a range of values for $\lambda_{\hat{\ell}}$. For example, in the case of $\hat{\ell} = 0$, Figures 3 and 4 show measures $G_{{\cal S}}^{\rm avg}$, $G_{{\cal S}}^{\rm med}$, $G_{{\cal S}}^{25}$ and $G_{{\cal S}}^{10}$, as well as the guessing entropy as computed in (6), for various values of $\lambda _0$. Figure 3 is for the Face scheme, and Figures 4 is for the Story scheme.

The key point to notice is that each of $G_{{\cal S}}^{\rm avg}$, $G_{{\cal S}}^{\rm med}$, $G_{{\cal S}}^{25}$ and $G_{{\cal S}}^{10}$ is very stable as a function of $\lambda _0$, whereas guessing entropy varies more (particularly for Face). We highlight this fact to reiterate our reasons for adopting $G_{{\cal S}}^{\rm avg}$, $G_{{\cal S}}^{\rm med}$, and $G_{{\cal S}}^{x}$ as our measures of security, and to set aside concerns over whether particular choices of $\lambda _0$ have heavily influenced our results. Indeed, even for $\hat{\ell} = 1$ (with some degree of back-off to $\hat{\ell} = 0$ as prescribed by (5)), values of $\lambda _0$ and $\lambda _1$ do not greatly impact our measures. For example, Figures 5 and 6 show $G_{{\cal S}}^{\rm avg}$ and $G_{{\cal S}}^{25}$ for Face. While these surfaces may suggest more variation, we draw the reader's attention to the small range on the vertical axis in Figure 5; in fact, the variation is between only 1361 and 1574. This is in contrast to guessing entropy as computed with (6), which varies between 252 and 3191 when $\lambda _0$ and $\lambda _1$ are varied (not shown). Similarly, while $G_{{\cal S}}^{25}$ varies between 24 and 72 (Figure 6), the analogous computation using (5) more directly--i.e., computing the smallest $j$ such that $\sum_{i=1}^{j} \Pr\left[{p_i}^{(k)} \leftarrow
{\cal S}\right] \ge .25$--varies between 27 and 1531. In the remainder of the paper, the numbers we report for $G_{{\cal S}}^{\rm avg}$, $G_{{\cal S}}^{\rm med}$, and $G_{{\cal S}}^{x}$ reflect values of $\lambda _0$ and $\lambda _1$ that simultaneously minimize these values to the extent possible.

Figure 5: $G_{{\cal S}}^{\rm avg}$ versus $\lambda _0$, $\lambda _1$ for Face
\begin{figure}\centerline{\epsfig{figure=average-bigram-entropy.eps,width=3in,clip=} }\end{figure}

Figure 6: $G_{{\cal S}}^{25}$ versus $\lambda _0$, $\lambda _1$ for Face
\begin{figure}\centerline{\epsfig{figure=25th-bigram-entropy.eps,width=3in,clip=}}\end{figure}

Tables 2 and 3 present results for the Story scheme and the Face scheme, respectively. Populations with less than ten passwords are excluded from these tables. These numbers were computed under Assumption 4.1 for $\hat{\ell} = 0$ in the case of Story and for $\hat{\ell} = 1$ in the case of Face. $\lambda _0$ and $\lambda _1$ were tuned as indicated in the table captions. These choices were dictated by our goal of minimizing the various measures we consider ( $G_{{\cal S}}^{\rm avg}$, $G_{{\cal S}}^{\rm med}$, $G_{{\cal S}}^{25}$ and $G_{{\cal S}}^{10}$), though as already demonstrated, these values are generally not particularly sensitive to choices of $\lambda _0$ and $\lambda _1$.


Table 2: Results for Story, $\lambda _0=2^{-2}$
Population $G_{{\cal S}}^{\rm avg}$ $G_{{\cal S}}^{\rm med}$ $G_{{\cal S}}^{25}$ $G_{{\cal S}}^{10}$
Overall 790 428 112 35
Male 826 404 87 53
Female 989 723 125 98
White Male 844 394 146 76
Asian Male 877 589 155 20



Table 3: Results for Face, $\lambda _0=2^{-2}, \lambda _1=2^2$
Population $G_{{\cal S}}^{\rm avg}$ $G_{{\cal S}}^{\rm med}$ $G_{{\cal S}}^{25}$ $G_{{\cal S}}^{10}$
Overall 1374 469 13 2
Male 1234 218 8 2
Female 2051 1454 255 12
Asian Male 1084 257 21 5.5
Asian Female 973 445 19 5.2
White Male 1260 81 8 1.6


The numbers in these tables should be considered in light of the number of available passwords. Story has $9 \times 8 \times 7 \times
6 = 3024$ possible passwords, yielding a maximum possible guessing entropy of $1513$. Face, on the other hand, has $9^4 = 6561$ possible passwords (for fixed sets of available images), for a maximum guessing entropy of $3281$.

Our results show that for Face, if the user is known to be a male, then the worst 10% of passwords can be easily guessed on the first or second attempt. This observation is sufficiently surprising as to warrant restatement: An online dictionary attack of passwords will succeed in merely two guesses for 10% of male users. Similarly, if the user is Asian and his/her gender is known, then the worst 10% of passwords can be guessed within the first six tries.

It is interesting to note that $G_{{\cal S}}^{\rm avg}$ is always higher than $G_{{\cal S}}^{\rm med}$. This implies that for both schemes, there are several good passwords chosen that significantly increase the average number of guesses an attacker would need to perform, but do not affect the median. The most dramatic example of this is for white males using the Face scheme, where $G_{{\cal S}}^{\rm avg} = 1260$ whereas $G_{{\cal S}}^{\rm med} = 81$.

These results raise the question of what different populations tend to choose as their passwords. Insight into this for the Face scheme is shown in Tables 4 and 5, which characterize selections by gender and race, respectively. As can be seen in Table 4, both males and females chose females in Face significantly more often than males (over 68% for females and over 75% for males), and when males chose females, they almost always chose models (roughly 80% of the time). These observations are also widely supported by users' remarks in the exit survey, e.g.:

``I chose the images of the ladies which appealed the most.''

``I simply picked the best lookin girl on each page.''

``In order to remember all the pictures for my login (after forgetting my `password' 4 times in a row) I needed to pick pictures I could EASILY remember - kind of the same pitfalls when picking a lettered password. So I chose all pictures of beautiful women. The other option I would have chosen was handsome men, but the women are much more pleasing to look at :)''

``Best looking person among the choices.''

Moreover, there was also significant correlation among members of the same race. As shown in Table 5, Asian females and white females chose from within their race roughly 50% of the time; white males chose whites over 60% of the time, and black males chose blacks roughly 90% of the time (though the reader should be warned that there were only three black males in the study, thus this number requires greater validation). Again, a number of exit surveys confirmed this correlation, e.g.:

``I picked her because she was female and Asian and being female and Asian, I thought I could remember that.''

``I started by deciding to choose faces of people in my own race ... specifically, people that looked at least a little like me. The hope was that knowing this general piece of information about all of the images in my password would make the individual faces easier to remember.''

``... Plus he is African-American like me.''


Table 4: Gender and attractiveness selection in Face.
  Female Male Typical Typical  
Pop. Model Model Female Male  
Female 40.0% 20.0% 28.8% 11.3%  
Male 63.2% 10.0% 12.7% 14.0%  



Table 5: Race selection in Face.
Pop. Asian Black White  
Asian Female 52.1% 16.7% 31.3%  
Asian Male 34.4% 21.9% 43.8%  
Black Male 8.3% 91.7% 0.0%  
White Female 18.8% 31.3% 50.0%  
White Male 17.6% 20.4% 62.0%  


Insight into what categories of images different genders and races chose in the Story scheme are shown in Tables 6 and 7. The most significant deviations between males and females (Table 6) is that females chose animals twice as often as males did, and males chose women twice as often as females did. Less pronounced differences are that males tended to select nature and sports images somewhat more than females did, while females tended to select food images more often. However, since these differences were all within four percentage points, it is not clear how significant they are. Little emerges as definitive trends by race in the Story scheme (Table 7), particularly considering that the Hispanic data reflects only two users and so should be discounted.


Table 6: Category selection by gender in Story
Pop. Animals Cars Women Food Children Men Objects Nature Sports  
Female 20.8% 14.6% 6.3% 14.6% 8.3% 4.2% 12.5% 14.6% 4.2%  
Male 10.4% 17.9% 13.6% 11.0% 6.8% 4.6% 11.0% 17.2% 7.5%  



Table 7: Category selection by race in Story
Pop. Animals Cars Women Food Children Men Nature Objects Sports  
Asian 10.7% 18.6% 11.4% 11.4% 8.6% 4.3% 17.1% 11.4% 6.4%  
Hispanic 12.5% 12.5% 25.0% 12.5% 0.0% 12.5% 12.5% 12.5% 0.0%  
White 12.5% 16.8% 13.0% 11.5% 6.3% 4.3% 16.8% 11.1% 7.7%  



next up previous
Next: Memorability evaluation Up: Security evaluation Previous: Measures