CVE-2024-5206: Sensitive Data Leakage in sklearn.feature_extraction.text.TfidfVectorizer in scikit-learn/scikit-learn
A sensitive data leakage vulnerability was identified in scikit-learn's TfidfVectorizer, specifically in versions up to and including 1.4.1.post1, which was fixed in version 1.5.0. The vulnerability arises from the unexpected storage of all tokens present in the training data within the stopwords attribute, rather than only storing the subset of tokens required for the TF-IDF technique to function. This behavior leads to the potential leakage of sensitive information, as the stopwords attribute could contain tokens that were meant to be discarded and not stored, such as passwords or keys. The impact of this vulnerability varies based on the nature of the data being processed by the vectorizer.
Other sources
scikit-learn could allow a remote authenticated attacker to obtain sensitive information, caused by an unexpected storage of all tokens present in the training data within the stopwords attribute. By sending a specially crafted request, an attacker could exploit this vulnerability to obtain passwords or keys information, and use this information to launch further attacks against the affected system.
— IBM
Affected Software
Remediation
Event History
Frequently Asked Questions
What is the severity of CVE-2024-5206?
CVE-2024-5206 is classified as a sensitive data leakage vulnerability.
How do I fix CVE-2024-5206?
To fix CVE-2024-5206, upgrade scikit-learn to version 1.5.0 or later.
Which versions of scikit-learn are affected by CVE-2024-5206?
CVE-2024-5206 affects versions of scikit-learn up to and including 1.4.1.post1.
What products are impacted by CVE-2024-5206?
CVE-2024-5206 impacts IBM Cloud Pak for Security and IBM QRadar Suite Software in specific versions.
Is CVE-2024-5206 a known issue in libraries?
Yes, CVE-2024-5206 is a known issue in the scikit-learn library regarding token storage.