Xiao Xiao originated from the Cave Fei Temple

Qubits | Official account QbitAI

From now on, every word you say publicly online may be taken by Google to train AI!

That's right, after painting, text works will also be used to feed large models -

Whether it's a tech blog, code, papers, or all the posts you make public online, you can be thrown into the "Google Big Model blender," even with copyright.

Google AI is eating everything! Crawling all public content for AI training, privacy policy updates

Just this week, Google updated its privacy policy to make it clear that it reserves the right to scrape all publicly available content on the web to build its AI tools.

Netizens immediately exploded. Someone warned that "Google is scraping everything":

Once Google can read what you write, it means that these are their "possessions."

There are also netizens who have more pessimistic ideas:

Soon, all content producers will be AI.

So, what's going on with this privacy policy?

Used to train AI products such as Bard

It has to start with Google's updated privacy policy in the past few days.

In its latest privacy policy, Google added a clause on AI models for "research and development":

Google uses the information to improve our services and develop new products, features and technologies for the benefit of our users and the public.

For example, we use publicly available information to help train Google's AI models and build useful products and features (such as Google Translate, Bard, and Cloud AI features).

In other words, all the publicly available information that may be collected is used in training AI-related products or functions such as Google Translate, Bard, and Cloud AI.

So, what exactly does this publicly available information include?

Examples include internet, web and other activity information, including search terms, information about apps and browsers' interactions with Google services, and the use of Google services on third-party websites and apps.

In other words, not only blogs and other content that have been previously publicized, including Google Docs that are publicly available on the Internet, or some posts containing personal information, may also be collected by Google for large model training.

Of course, for now, these contents are still limited to "public information".

Email services like Google's Gmail should still not crawl into the data.

Moreover, Google also explicitly stated in its privacy policy that it may also use this personal or public information for other reasons, such as security threats, information review, service maintenance, personalized advertising, or the law.

But why is Google updating this policy at this juncture?

"AI is challenging text copyright"

It may also be related to the "throttling" operations of companies such as Reddit and Twitter.

First, in April, Reddit announced that it would start charging companies for access to the API.

CEOs think Reddit's database is valuable, but they don't want to give that valuable content to tech companies for free.

Subsequently, Twitter also began to limit Twitter on the grounds that "I don't want AI companies to prostitute data", and the daily views of unverified users are only 600, which increases to 6,000 after verification.

This series of policies has a serious impact on users and third-party tools, such as Reddit sparked a large-scale discussion board protest, many moderators directly closed their own forums to protest the Reddit event, there are many people on Twitter, and even some netizens said that "Twitter was killed."

But in any case, the matter of letting AI prostitute data is now a contradiction that cannot be ignored.

For the matter of Google AI crawling data, some netizens expressed doubts:

Why did the Internet such as search engines crawl data before, but people were resistant to "AI scraping".

Some netizens responded:

It's essentially a matter of copyright. If you just quote copyrighted material, then it is not necessarily copyright infringement, but if AI is used to "stir and clean" copyrighted content, and this is legalized, then copyright is essentially dead.

It is precisely because of this that he is pessimistic about this matter:

Would you accept that someone copied your blog without attributing the source, or took your open source code for a paid service, or used your answers on StackOverflow as a way to answer questions?

Everything I did before was free. But now if AI wants me to disappear, then I will disappear.

Of course, some netizens have accepted the launch of this policy, and vigilance is indispensable for everyone's own awareness of prevention:

Peruse the new policy and notice how much information we leak online.

So, what do you think about this?

Reference Links:

[1]https://gizmodo.com/google-says-itll-scrape-everything-you-post-online-for-1850601486

[2]https://news.ycombinator.com/item?id=36577626

— End —

Qubits QbitAI · Headline number signed

Google AI is eating everything! Crawling all public content for AI training, privacy policy updates

Used to train AI products such as Bard

"AI is challenging text copyright"