top of page

AI firm with ties to U.S. government exposes of billions of documents in breach.

New research from data security firm UpGuard shows that a U.S. government AI contractor’s massive database of sensitive documents was exposed on the Internet until the end of last month.



In a post on its blog, UpGuard breaks down how Veritone AI exposed 550GB of internal and client data including audio, video and biometric image media, employee PII, police body camera footage, FOIA requests and related documents, employee credentials, system logs with authorization tokens, and more.


The exposed centralized dataset contained sensitive information about Veritone resources and users, including employees’ full names, usernames, and email addresses. But exposure of government personnel data was of particular concern. “Internal credentials also appear in the exposed logs, such as application tokens and, in some cases, plain text passwords.


The unauthorized use of these credentials would grant a threat actor whatever level of access the exposed accounts held, possibly exposing additional sensitive data to a malicious third party.”


At least some of the exposed personal data was being used to train AI systems, which has some observers asking if machine learning algorithms touting their security bonafides are in fact creating a mother lode of vulnerable data honeypots.


“What we have become accustomed to call ‘artificial intelligence’ relies on concatenating pieces of an enormous dataset with a complex algorithm and detailed data tagging” says UpGuard. “Because AI technologies often require massive databases full of whatever information they are analyzing, both the likelihood and impact of a data exposure rapidly increase.”


It notes that “a significant portion of the services Veritone provides for government and police agencies involves automatically redacting sensitive information from documents, analyzing facial recognition data (referred to as identifying suspects), and processing audio and video surveillance data to find insights, keywords, and types of images.”


It also points out that Veritone provides AI services for a wide array of industries, including law, energy, and entertainment – meaning the potential for data breaches is everywhere.


UpGuard discovered Veritone’s first exposed Elasticsearch server hosted on the Microsoft Azure Government Cloud on March 23. It contained 464 million documents. The next day, the second server was discovered, containing 1.2 billion documents. According to the blog, “these servers did not require or ask for any credentials but rather provided anonymous access to anyone on the internet.”


After being made aware of the breach, Veritone secured the Elastic servers on March 30. The data is no longer publicly available.


In this case, the fault does not lie with Elasticsearch.


The software, an open source search and analytics engine designed to quickly search large datasets, can be configured to require authentication. However, Veritone’s servers were not configured as such – an oversight that undercut other security measures and left the government data exposed.


Elasticsearch has been transparent about the necessity of configuring the software for authentication. A blog from 2020 outlines simple steps users can take to secure their data from breaches.


In an interview with Axios, UpGuard VP of Cyber Research Greg Pollock says Microsoft is likely also off the hook. “Microsoft is providing the government cloud as a service; they’re probably not involved in the administration of this database,” Pollock says.


If the responsibility lies with Veritone in its failure to properly configure the Elasticsearch servers – as UpGuard’s assessment clearly implies in stating that “operational tasks such as spinning up an Elastic server should have controls in place to ensure that the server is not publicly accessible” – it is not the first AI firm to mishandle data.


Still, given the volume and sensitivity of Veritone’s information, the breach could have significant implications for how AI training databases are collected, stored and secured.


0 views0 comments
bottom of page