Whistleblower: ChatGPT training practices violated copyright laws
A former artificial intelligence developer working on ChatGPT at OpenAI has said he believes copyright laws were being violated during the training processes. Suchir Balaji, 26, quit OpenAI in August after nearly four years at the company. He was being considered as a witness in litigation against the company and has since turned up deceased amongst a spat of controversy.
Balaji was well-regarded by colleagues. The co-founder of the San Francisco based OpenAI, John Schulman, said on social media that he was one of their strongest and essential contributors. Schulman also left the company on the same day and spoke highly of Suchir, mentioning Belaji’s desires of pursuing a PhD in developing artificial intelligence since he didn’t believe, unlike the others, that breakthroughs in general intelligence were near.
“Suchir’s contributions to this project were essential, and it wouldn’t have succeeded without him,” said OpenAI co-founder John Schulman in a LinkedIn post by Balaji’s father. Schulman, who recruited Balaji, elaborated on his skills saying he was an exceptional engineer and scientist while also highlighting his attention to detail and ability to notice subtle bugs or logical errors. “He had a knack for finding simple solutions and writing elegant code that worked,” Schulman wrote. “He’d think through the details of things carefully and rigorously.”
Since leaving the company in August 2024 and becoming a whistleblower, the University of California Berkeley graduate was found dead in his San Francisco apartment on November 26, 2024. While no foul play has been reported and the death is being ruled a suicide, many have refused to accept the chief medical examiner’s investigation amidst mysterious circumstances.
The Guardian provided more details of Balaji’s back story in an article explaining the events. Balaji grew up in the San Francisco Bay Area and first arrived at the fledgling AI research lab for a 2018 summer internship while studying computer science at UC Berkeley. He returned a few years later to work at OpenAI, where one of his first projects, called WebGPT, helped pave the way for ChatGPT.
Balaji later shifted to organizing the huge datasets of online writings and other media used to train GPT-4, the fourth generation of OpenAI’s flagship large language model and a basis for the company’s famous chatbot. It was that work that eventually caused Balaji to question the technology he helped build, especially after newspapers, novelists and others began suing OpenAI and other AI companies for copyright infringement.
Suchir Balaji was listed in a court filing as having ‘relevant documents’ about copyright violations. He later told the Associated Press he would “try to testify” in the strongest copyright infringement cases and considered a lawsuit brought by the New York Times last year to be the “most serious.” The NY Times lawyers named him in an 18 November court filing as someone who might have “unique and relevant documents” supporting allegations of OpenAI’s willful copyright infringement. His records were also sought by lawyers in a separate case brought by book authors including the comedian Sarah Silverman, according to a court filing.
“It doesn’t feel right to be training on people’s data and then competing with them in the marketplace,” Balaji told the AP in late October. “I don’t think you should be able to do that. I don’t think you are able to do that legally.”
He told the AP that he had grown gradually more disillusioned with OpenAI, especially after the internal turmoil that led its board of directors to fire and then rehire the CEO, Sam Altman, last year. Balaji said he was broadly concerned about how its commercial products were rolling out, including their propensity for spouting false information known as hallucinations. But of the “bag of issues” he was concerned about, he said, he was focusing on copyright as the one it was “actually possible to do something about”. He also acknowledged that it was an unpopular opinion within the AI research community, which is accustomed to pulling data from the internet, but said “they will have to change and it’s a matter of time.”
Balaji’s parents have also questioned that ruling of no foul play and hired a private investigator. They ordered an independent autopsy to further investigate the death and Balaji’s mother has since took to social media claiming, “Private autopsy doesn’t confirm the cause of death stated by police.”
Klemchuk Intellectual Property Attorneys provide further legal insight on the case. In addition to reporting evidence of a “head injury” in the private autopsy they ordered, the parents claim Balaji’s apartment was “ransacked,” there were signs of a struggle—and blood—in his bathroom, and no suicide note was found. More recently, Balaji’s parents have requested an FBI Investigation into the matter, stating that the local authorities lack the ability to thoroughly investigate a case that includes cybersecurity and whistleblower protection. San Francisco police report the case remains “open and active.”
Balaji was listed as a “person with knowledge” in the Author’s Guild and New York Times lawsuits pending against OpenAI and had expressed that he intended to testify. Since first requesting FBI involvement, his parents report that Balaji was gathering evidence and preparing to “go public in a big way,” including potentially bringing his own legal action against OpenAI.
The family also reported unusual activity on Balaji’s devices after his death, including temporary files from Google Chrome and Google Drive on November 29 – two days after his death—and have further noted that a device reportedly containing “sensitive information” related to the lawsuits pending against OpenAI is missing from the apartment.
Given Belaji’s unavailability, counsel for OpenAI is certain to argue for the exclusion of evidence as hearsay—an out of court statement, offered for the truth of the matter asserted, and on which the speaker is not available for cross-examination. The plaintiffs’ attorneys could potentially argue present sense impression, asserting that the interview and blog post happened shortly after Balaji’s discovery of OpenAI’s infringement and his resignation from the company for that reason.
Another possibility is arguing the admission of the evidence under the residual hearsay exception of Federal Rule of Evidence 807, which provides that even if the information is hearsay, it is not excluded if “the statement is supported by sufficient guarantees of trustworthiness—after considering the totality of the circumstances under which it was made and the evidence, if any, corroborating the statement; and it is more probative on the point for which it is offered than any other evidence that the proponent can obtain through reasonable efforts.” Given Balaji’s insider status at the time of the alleged infringement, there is no doubt his information is relevant and highly probative to the cases, but whether the plaintiffs’ attorneys can satisfy the other parameters of the residual hearsay exception will depend on the totality of the evidence.