Security Evaluation of Password Generation, Storage, and Autofill in Browser-Based Password Managers

1. Introduction

Password-based authentication remains the dominant method for web authentication despite its well-documented security challenges. Users face cognitive burdens when managing multiple strong passwords, leading to password reuse and weak password creation. Password managers promise to alleviate these issues by generating, storing, and autofilling passwords. However, their security has been questioned by prior research. This paper presents an updated, comprehensive security evaluation of thirteen popular browser-based password managers, examining the full lifecycle: generation, storage, and autofill.

2. Methodology & Scope

We evaluated thirteen password managers, including five browser extensions (e.g., LastPass, Dashlane), six browser-integrated managers (e.g., Chrome, Firefox), and two desktop clients for comparison. The evaluation framework covered three core phases: analyzing the randomness of 147 million generated passwords, assessing storage security (encryption, metadata handling, defaults), and testing autofill vulnerabilities against attacks like clickjacking and XSS.

3. Password Generation Analysis

This section details the first large-scale analysis of password generation algorithms in password managers.

3.1. Randomness Evaluation Framework

We employed statistical tests for randomness, including frequency analysis, entropy calculation, and tests for uniform distribution across the defined character sets (uppercase, lowercase, digits, symbols).

3.2. Character Distribution Findings

Several managers exhibited non-random character distributions. For example, some showed bias towards certain character positions or sets, reducing the effective entropy of generated passwords below theoretical expectations.

3.3. Vulnerability to Guessing Attacks

A significant finding was that a subset of generated passwords—particularly those shorter than 10 characters—were vulnerable to online brute-force attacks. Passwords shorter than 18 characters were found to be potentially vulnerable to offline attacks, assuming modern hardware capabilities.

4. Password Storage Security

Replicating and extending prior work by Li et al., we assessed how passwords are encrypted and stored locally and in the cloud.

4.1. Encryption & Key Management

While most managers use strong encryption (e.g., AES-256), key derivation functions and key storage mechanisms varied, with some implementations being weaker than others.

4.2. Metadata Protection

A critical flaw identified was the storage of sensitive metadata (e.g., website URLs, usernames) in plaintext or with insufficient protection, creating a privacy risk even if the password itself is encrypted.

4.3. Default Configuration Analysis

Several password managers had insecure defaults, such as enabling automatic autofill or not requiring the master password upon browser restart, increasing the attack surface.

5. Autofill Mechanism Vulnerabilities

Autofill, while convenient, introduces significant attack vectors. We tested against known exploit classes.

5.1. Clickjacking & UI Redressing

We found that several managers remained vulnerable to clickjacking attacks, where a malicious site overlays invisible elements over legitimate UI buttons to trick users into triggering autofill on a attacker-controlled field.

5.2. Cross-Site Scripting (XSS) Risks

If a website has an XSS vulnerability, an injected script could potentially interact with the password manager's DOM elements to exfiltrate credentials, a risk highlighted in earlier work by Stock and Johns.

5.3. Network Injection Attacks

Managers that communicate with cloud services for syncing or features were tested for susceptibility to man-in-the-middle attacks that could inject malicious code or steal authentication tokens.

6. Results & Comparative Analysis

Overall, security has improved compared to evaluations from five years prior, but significant issues persist. No single manager was flawless across all three categories (generation, storage, autofill). Browser-integrated managers often had simpler, more secure autofill logic but weaker generation algorithms. Third-party extensions offered more features but introduced greater complexity and attack surface. We identify specific managers that performed poorly and should be avoided by security-conscious users.

Managers Evaluated

Passwords Generated & Analyzed

147M+

Managers with Critical Flaws

7. Recommendations & Future Directions

For Users: Choose managers with strong security track records, enable all available security features (like 2FA), and be cautious with autofill. For Developers: Implement cryptographically secure random number generators (CSPRNGs) for password generation, encrypt all metadata, adopt secure defaults (e.g., master password always required), and harden autofill against UI manipulation. For Researchers: Explore the usability-security trade-off of autofill, develop standardized security evaluation frameworks, and investigate post-quantum cryptography for future-proofing.

8. Original Analysis & Expert Commentary

Core Insight: The Oesch and Ruoti study delivers a sobering reality check: the very tools designed to solve the password crisis are themselves a patchwork of vulnerabilities. The industry's focus on convenience and feature bloat has, in several cases, directly undermined core security promises. The finding that generated passwords can be weak is particularly damning—it strikes at the heart of the password manager's value proposition.

Logical Flow: The paper brilliantly structures its attack along the user journey: creation (generation), at-rest (storage), and in-use (autofill). This lifecycle approach, reminiscent of threat modeling in frameworks like Microsoft's STRIDE, reveals that weaknesses are not isolated but systemic. A flaw in generation reduces the effectiveness of strong storage; a flaw in autofill nullifies both. This interconnectedness is often missed in point-in-time audits.

Strengths & Flaws: The study's strength is its comprehensiveness and replication of past work, providing a rare longitudinal view of security evolution. The massive corpus of 147 million generated passwords for analysis is commendable. However, the analysis has a flaw common to many security evaluations: it's largely a black-box, functional test. It identifies what is broken but provides less insight into why from a software engineering perspective—were these flaws due to rushed deadlines, misunderstood specifications, or a lack of security review? Furthermore, while it references the NIST Digital Identity Guidelines, a deeper dive into how these managers align (or fail to align) with standards like FIPS 140-3 or the security requirements outlined in the IETF's Password Authenticated Key Exchange (PAKE) proposals would have added significant weight.

Actionable Insights: For enterprise security teams, this paper is a mandate to scrutinize approved password managers rigorously. Relying on brand reputation is insufficient. Procurement checklists must include specific tests for generation randomness (e.g., using standardized test suites like Dieharder or NIST's STS), metadata encryption, and autofill behavior under attack simulations. For developers, the lesson is to prioritize simplicity and secure defaults. The most secure autofill mechanism might be the simplest: a manual "click-to-fill" that requires explicit, conscious user action, as suggested by research from the University of California, Berkeley on explicit consent interfaces. The future lies not in trying to make intelligent, automatic filling perfectly secure, but in designing minimally intrusive yet maximally explicit user interactions that keep the human in the loop for critical security decisions.

9. Technical Details & Mathematical Framework

The evaluation of password generation randomness relied on calculating the Shannon entropy $H$ of the generated passwords:

$H = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$

where $P(x_i)$ is the probability of character $x_i$ appearing in a given position. For a truly random selection from a set of $C$ characters, the maximum entropy per character is $\log_2(C)$. For a 72-character set (26 lowercase + 26 uppercase + 10 digits + 10 symbols), max $H_{char} \approx 6.17$ bits. A 10-character password thus has a theoretical maximum of ~61.7 bits of entropy.

The study found that biases in some managers' algorithms reduced the effective entropy. The vulnerability to offline attacks was assessed using an estimated cracking rate $R$ (hashes per second) and the password space $N$:

$\text{Time to crack} \approx \frac{N}{2 \times R}$

Assuming a high-end rate of $10^{10}$ hashes/sec (within range for modern GPU clusters), a password with less than ~65 bits of entropy ($N = 2^{65}$) could be cracked in a feasible timeframe for a motivated attacker.

10. Experimental Results & Data Visualization

Key Chart 1: Character Distribution Bias. A bar chart comparing the observed vs. expected frequency of character types (uppercase, lowercase, digit, symbol) across multiple password managers. Several managers showed statistically significant deviation (p < 0.01) from the expected uniform distribution, with an over-representation of digits in certain positions.

Key Chart 2: Entropy vs. Password Length. A scatter plot showing the measured entropy per manager for different configured password lengths (8, 12, 16, 20 characters). The plot would reveal that while most managers approach the theoretical entropy line for longer passwords, several fall short for shorter lengths (8-12 chars), clustering below the line, indicating weaker randomness.

Key Chart 3: Autofill Vulnerability Matrix. A heatmap with managers on the Y-axis and vulnerability classes (Clickjacking, XSS-leakage, Network Injection) on the X-axis. Cells are colored green (not vulnerable), yellow (partially/variably vulnerable), and red (vulnerable). This visualization clearly shows which managers are riskiest across the autofill attack surface.

11. Analysis Framework: Case Study Example

Case: Evaluating "Manager X's" Autofill Security.

Step 1 - Feature Mapping: Document how Manager X triggers autofill: Does it auto-populate? Does it show a dropdown? What DOM attributes does it rely on (id, name, class, placeholder)?

Step 2 - Threat Modeling: Apply the STRIDE model.

Spoofing: Can a fake login form trick the manager? (Test with variations of `id="password"`).
Tampering: Can JavaScript modify the filled-in data before submission?
Repudiation: Does the manager log autofill events?
Information Disclosure: Can a hidden iframe or crafted CSS (opacity:0.001) cause filling into an invisible field that is then exfiltrated?
Denial of Service: Can malicious sites lock the autofill feature?
Elevation of Privilege: Does autofill work on browser chrome pages? (Should not).

Step 3 - Test Execution: Create a test harness webpage that systematically attempts each threat vector. For clickjacking, create overlapping transparent elements. For XSS, simulate a script reading the `value` property of filled fields.

Step 4 - Analysis & Scoring: Rate each vulnerability based on likelihood and impact (e.g., using DREAD scoring). Aggregate score determines overall autofill security rating for Manager X.

This structured approach moves beyond ad-hoc testing and ensures comprehensive coverage.

12. Future Applications & Research Directions

1. Integration with WebAuthn/Passkeys: The future is passwordless. The next evolution for password managers is to become primary brokers for passkeys (based on the W3C Web Authentication API). Research is needed on secure syncing and recovery of passkey private keys across devices, a challenge highlighted by the FIDO Alliance.

2. Context-Aware, Risk-Based Autofill: Instead of binary fill/don't-fill logic, future managers could use machine learning to assess page legitimacy (checking domain age, SSL cert, reputation scores) and user context (typical login time, device) to adjust autofill behavior, requiring additional authentication for high-risk scenarios.

3. Formal Verification & Secure Hardware: Critical components, especially the random number generator and the core encryption/decryption routines, could be formally verified using tools like Coq or Tamarin Prover. Integration with Trusted Platform Modules (TPMs) or Secure Enclaves for key storage could elevate security for high-value targets.

4. Decentralized & User-Centric Architectures: Moving away from centralized cloud vaults to decentralized protocols (e.g., based on secure multi-party computation or personal servers) could mitigate risks of large-scale provider breaches. This aligns with the broader "Solid" project vision for personal data pods.

13. References

Oesch, S., & Ruoti, S. (2020). That Was Then, This Is Now: A Security Evaluation of Password Generation, Storage, and Autofill in Browser-Based Password Managers. USENIX Security Symposium.
Li, Z., He, W., Akhawe, D., & Song, D. (2014). The Emperor’s New Password Manager: Security Analysis of Web-based Password Managers. IEEE Symposium on Security and Privacy.
Stock, B., & Johns, M. (2016). Protecting the Intranet Against "JavaScript Malware" and Related Attacks. IEEE EuroS&P.
National Institute of Standards and Technology (NIST). (2017). Digital Identity Guidelines (SP 800-63B).
FIDO Alliance. (2022). FIDO2: WebAuthn & CTAP Specifications. https://fidoalliance.org/fido2/
Grassi, P., et al. (2017). NIST Special Publication 800-63B: Digital Identity Guidelines - Authentication and Lifecycle Management.
Silver, D., Jana, S., Boneh, D., Chen, E., & Jackson, C. (2014). Password Managers: Attacks and Defenses. USENIX Security Symposium.
Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal.