OpenAI says it’s “impossible” to create useful AI models without copyrighted material

sculd@beehaw.org · 10 months ago

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

bedrooms@kbin.social · edit-2 10 months ago

Alas, AI critics jumped onto the conclusion this one time. Read this:

Further, OpenAI writes that limiting training data to public domain books and drawings “created more than a century ago” would not provide AI systems that “meet the needs of today’s citizens.”

It’s a plain fact. It does not say we have to train AI without paying.

To give you a context, virtually everything on the web is copyrighted, from reddit comments to blog articles to open source software. Even open data usually come with copyright notice. Open research articles also.

If misled politicians write a law banning the use of copyrighted materials, that’ll kill all AI developments in the democratic countries. What will happen is that AI development will be led by dictatorships, and that’s absolutely a disaster even for the critics. Think about it. Do we really want Xi, Putin, Netanyahu and Bin Salman to control all the next-gen AIs powering their cyber warfare while the West has to fight them with Siri and Alexa?

So, I agree that, at the end of the day, we’d have to ask how much rule-abiding AI companies should pay for copyrighted materials, and that’d be less than the copyright holders would want. (And I think it’s sad.)

However, you can’t equate these particular statements in this article to a declaration of fuck-copyright. Tbh Ars Technica disappointed me this time.

P03 Locke@lemmy.dbzer0.com · 10 months ago

It’s bizarre. People suddenly start voicing pro-copyright arguments just to kill an useful technology, when we should be trying to burn copyright to the fucking ground. Copyright is a tool for the rich and it will remain so until it is dismantled.

AVincentInSpace@pawb.social · edit-2 10 months ago

Life plus 70 years is bullshit.

20 years from release date is not.

No one except corporate bigwigs will say they should be allowed to do so in perpetuity, but artists still need legal protections to make money off of what they create, and Midjourney (making OpenAI boatloads of money off of making automated collages from artwork they obtained not only without compensation but without attribution) is a prime example of why.

AVincentInSpace@pawb.social · 10 months ago

“But you see, we have to let corporations break the law, because if we don’t, a country we might be at war with later will”

krellor@beehaw.org · edit-2 10 months ago

The issue is that fair use is more nuanced than people think, but that the barrier to claiming fair use is higher when you are engaged in commercial activities. I’d more readily accept the fair use arguments from research institutions, companies that train and release their model weights (llama), or some other activity with a clear tie to the public benefit.

OpenAI isn’t doing this work for the public benefit, regardless of the language of altruism they wrap it in. They, and Microsoft, and hoovering up others data to build a for profit product and make money. That’s really what it boils down to for me. And I’m fine with them making money. But pay the people whose data you’re using.

Now, in the US there is no case law on this yet and it will take years to settle. But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.

bedrooms@kbin.social · 10 months ago

Well, regarding text online, most is there fir the visitors to read fir free. So, if we end up treating these AI training like human reading text one could argue they don’t have to pay.

Reddit doesn’t pay their users, anyway.

But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.

Agreed. That said, NYT actually intentionally allows Google and Bing servers to parse their news articles in order to put their articles top in the search results. In that regard they might like certain form of processing by LLMs.

krellor@beehaw.org · 10 months ago

I thought about the indexing situation in contrast to the user paywall. Without thinking too much about any legal argument, it would seem that NYT having a paywall for visitors is them enforcing their right to the content signaling that it isn’t free for all use, while them allowing search indexers access is allowing the content to visible but not free on the market.

It reminds me of the Canadian claim that Google should pay Canadian publishers for the right to index, which I tend to disagree with. I don’t think Google or Bing should owe NYT money for indexing, but I don’t think allowing indexing confers the right for commercial use beyond indexing. I highly suspect OpenAI spoofed search indexers while crawling content specifically to bypass paywall and the like.

I think part of what the courts will have to weigh for the fair use arguments is the extent to which NYT it’s harmed by the use, the extent to which the content is transformed, and the public interest between the two.

I find it interesting that OpenAI or Microsoft already pay AP for use of their content because it is used to ensure accurate answers are given to users. I struggle to see how the situation is different with NYT in OpenAI opinion, other than perhaps on price.

It will be interesting to see what shakes out in the courts. I’m also interested in the proposed EU rules which recognize fair use for research and education, but less so for commercial use.

Thanks for the reply! Have a great day!