laitimes

Third-party data scraping is legal, and X's API business may be yellow

author:Three easy life

Since Gemini admitted that he used Wenxin Yiyan to train Chinese corpus, to the prediction that high-quality datasets of major institutions are about to dry up, and large models may face the crisis of "going out" in the next few years, it has become a "sword of Damocles" hanging over the head of the entire AI industry today. In order to collect more data or corpus to water the large model, "buy, buy, buy" has also become a solution for related manufacturers, such as Google buys data from Reddit for $60 million a year, and OpenAI is looking for news publishers all over the world to sign content licensing agreements.

Third-party data scraping is legal, and X's API business may be yellow

For a while, selling data to AI vendors has become a good way for content platforms to make money. For example, X, which suffered from the loss of a large number of advertisers leaving, relied on API paywalls last year to charge more than one million dollars to every enterprise-level customer who needed X's user data.

However, just a year later, X's business of selling data to third parties through APIs may be going to be ruined. A few days ago, X sued Israeli data company Bright Data for illegally scraping millions of records on the platform, and the federal court in California rejected all of X's claims.

In August last year, Company X claimed that Bright Data blatantly violated the platform's service agreement and evaded the platform's risk control through technical means, thereby illegally scraping data such as replies, likes, and retweets on X in batches, and believed that these illegal acts had a serious impact on X's servers and damaged the user experience, so it requested injunctive relief to stop Bright Data's behavior.

In response, Bright Data said that Company X has built a wall to deny others access to the platform's public data, and will defend their position in court to ensure that everyone has public access to the Internet and related data.

Third-party data scraping is legal, and X's API business may be yellow

The use of crawlers to collect data on the Internet has been a gray area operation for the past two decades, and the practice of each company is basically "quietly enter the village, don't shoot", and there are few vendors like Bright Data who have the audacity to admit that they are doing so. What is even more surprising is that the court did not support X as a victim. Therefore, there is a view that this US federal court's ruling may greatly affect the pattern of the Internet industry.

The court used the reasoning used in rejecting X's claim that the social network did not actually own the user data, as the platform could not enjoy the benefits of the safe harbor principle on the one hand and emphasize that the data belonged to them on the other. This is tantamount to denying the legal principle of social platform's sovereignty over user data, since X does not own the data itself, but provides public data to users through other means, then Bright Data's act of scraping public data is not illegal.

Third-party data scraping is legal, and X's API business may be yellow

In a sense, the safe harbor principle that once shielded a large number of U.S. Internet platforms from legal turmoil has now become a stumbling block for them to sell data. The so-called "safe harbor principle" is a concept proposed in the Digital Millennium Copyright Act enacted by the United States in 1998, which aims to solve the legal issues related to copyright protection in the context of the Internet.

Specifically, after receiving the notice from the rights holder, the network service provider needs to promptly convey the relevant notice to the user, and take necessary measures such as deleting, blocking, or disconnecting the link to the infringing information based on the preliminary evidence and the type of service. As long as the network service provider fulfills the above obligations, it will enter the "safe harbor" and will not bear tort liability. "It is impossible for us to monitor everything that happens on the platform in real time" is a common phrase used by relevant internet platforms to avoid regulatory responsibilities.

Third-party data scraping is legal, and X's API business may be yellow

Under the protection of the safe harbor principle, Internet vendors spent their early years at the beginning of the new century. However, with the prosperity of the Internet economy, after the start-ups have grown into giants, the safe harbor principle of avoiding regulatory responsibilities has in turn made them lose the right to declare the ownership of user data at the legal level.

Since under the safe harbor principle, after a user publishes infringing content on the platform, the right holder can notify the platform to delete the content, as long as the platform deals with it in a timely manner, then the copyright owner cannot sue the platform, but only the user who posted the infringement. So if it is not the platform's behavior to claim that the content posted by the user is not the platform's behavior, why does the platform have the user's data at the legal level? Attacking the shield of the son with the spear of the son is the key to X's failure to obtain injunctive relief this time.

Third-party data scraping is legal, and X's API business may be yellow

Coincidentally, not only did Bright Data win the match against X, but a U.S. court rejected Meta's similar claims earlier this year. In just half a year, two consecutive identical precedents have shown that the wind has indeed changed for Internet platforms. The problem facing X and Meta now is that they have to choose between the safe harbor principle and user data, and Internet manufacturers actually have only one choice, that is, to continue to adhere to the safe harbor principle. Because even though the safe harbor principle is becoming more and more difficult to use, its existence still exempts Internet vendors from most of their regulatory responsibilities.

In other words, in the future, anyone can scrape data from American social platforms. The business of Internet vendors selling data to AI vendors is likely to come to an end at the beginning. After all, it costs real money to buy data, but if you use technical means to bypass the barriers set by the target, the cost will obviously be much lower. However, Internet manufacturers are almost not lacking in technical strength, so after the era of big data, crawlers and anti-crawlers may once again become a major issue for Internet manufacturers.

Third-party data scraping is legal, and X's API business may be yellow

It's just that for users, this precedent of the U.S. federal court may not be a good thing, at least everyone's experience of using the relevant platform in the future is likely to deteriorate. Generally speaking, the anti-crawler strategy of Internet manufacturers revolves around determining whether a user is a human, and the most effective means are not JavaScript parameter encryption and code obfuscation, but captcha and human-computer verification. Therefore, in the future, all kinds of crazy verification codes may reappear, and everyone may have to fight wits and courage with strange verification codes.