laitimes

Data Type Differentiation and Ownership Analysis in Case Studies(29)

author:YunfangW

Weibo v. iDataAPI: About data scraping methods

The plaintiff asserted that the defendant had five sued acts, focusing on the identification of data capture, storage and sales and the identification of data capture methods, which were the core issues of the dispute between the parties.

The plaintiff's evidence revolved around the following:

  • The defendant set up 11 Weibo API data interfaces on the iDataAPI website operated by the defendant, and opened iDataAPI user registration procedures, account numbers, passwords, Alipay and other payment channels;
  • Claim that Weibo data can be obtained through API interface on the iDataAPI website, introduce the charging price for selling Weibo data, and update the total number of calls to Weibo data on the iDataAPI website in a timely manner;
  • Explain the data transmission, interaction and display and related technical processes between the microblog terminal (user side) and the server side, and explain the content and type of user data and server data, as well as the call rules;
  • Compare the data content obtained using the 11 iData API microblog interfaces with the content of the corresponding Weibo page on the Weibo platform to prove that what the defendant captured, stored, and sold included not only the data publicly displayed on the Weibo terminal page, but also the data not publicly displayed on the terminal page;
  • Among them, interface 3 captures the complete content of the articles paid by v+ members in the background of Weibo, and Weibo sets up technical protection measures for the paid reading articles of V+ members, so that only exclusive members of Weibo can read the complete content, and the plaintiff uses the method of "inserting encrypted strings" to prove that the defendant used malicious technical means to carry out the sued acts.

The court of first instance held that the relevant facts could prove that the defendant had captured, stored, and sold Weibo data through the iDataAPI Weibo data interface, and the amount was huge.

On the issue of whether the defendant used malicious technical means to capture, store, and sell Weibo data:

  • The plaintiff claimed that when the defendant grabbed Weibo data through the Weibo data API interface, in order to avoid detection, it maliciously disguised itself as multiple Weibo accounts and adopted the method of changing IP to capture Weibo data.

1) The plaintiff stated the working principle and technical measures related to API call data as follows:

When the user uses the microblog service, the terminal transmits and interacts with the data on the server.

In order to achieve a high degree of automation and large-scale data transmission, the Weibo server specifies the format and content of the data request sent by the terminal to it, and when these requirements are met, the server will transmit the corresponding data to the user according to the content of the request.

When the terminal sends a data request to the server, it should provide the data interface location of the microblog server (URL: https:/.. ), provide the information of the terminal device (User_agent), IP address, and the uid identification number (Token) issued by Weibo for the logged-in user.

In order to prevent malicious data scraping, when the Weibo server receives a data request from a terminal, it will comprehensively determine whether the data request is highly suspected to come from a computer program that captures the data, or a real Weibo user, based on the content of the data request, as well as the frequency and frequency of the data request.
If there is indeed a high degree of suspicion that it comes from a computer program that scrapes the data, rather than a real Weibo user, the Weibo server refuses to transmit the requested data to the terminal.

2) In order to prove that the defendant used "malicious technical means", the plaintiff adopted the method of "inserting an encrypted string" when collecting evidence:

The plaintiff's agent encrypts the IP address, Weibo user uid and other information received by the Weibo server and provided by the iDataAPI website every time it captures Weibo data through the iDataAPI Weibo data interface 3, forming an encrypted string, which is placed in the data of the articles paid by Weibo v+ members, under the "user_blockchain" parameter.
From the data captured by the iDataAPI microblog interface 3, the parameter and the encrypted string contained in the parameter can be found. By decrypting these strings, you can get the IP addresses and Weibo user uids used by Jian Yixun to capture Weibo data.
The above process is operated 10 times, and different IP addresses and Weibo user uid information are obtained after each decryption, which proves that the iDataAPI website will change the IP address and Weibo user account every time it captures Weibo data.
  • The defendant argued in the first instance that it did not obtain Weibo data by changing IP addresses and Weibo user accounts, did not implement the technical protection measures set by Weimeng Company to obtain paid articles for v+ members, and did not use any crawler technology to obtain Weibo data, and proposed that Weibo data could also be obtained through the following two methods:

1) Weibo provides anonymous users with an official API interface that can obtain Weibo data based on the HTTPS protocol, which is shared by program enthusiasts in the article "Collection Related Interfaces" on their personal websites. By combining the interface address in the article with the ID number of the 50 V+ members paid to read the articles involved in the plaintiff's evidence, you can see the full text of the relevant V+ articles, and in this way, you can also get the full text of the latest 10 V+ paid articles on Weibo.

2) Enter keywords through Baidu search, and you can also search for an article and blog post cited by Weimeng Company.

  • The court of first instance held that the defendant's defense was untenable, and in addition to the content of the plaintiff's evidence, the following factors were also considered:

1) The date of the post is April 2018, and the iDataAPI website shows that the Weibo data was released on December 2, 2016.

2) Based on the consideration of obtaining economic benefits, the plaintiff has set up technical protection measures for V+ members to pay to read the articles, and according to common sense, it is unlikely that the plaintiff will disclose the URL of such articles to the public. At the same time, the defendant obtained the plaintiff's permission to prove that the above-mentioned program enthusiasts shared the URL of the V+ article, and the plaintiff also claimed that the sharing was illegal.

3) The defendant indicated on the iDataAPI website and WeChat official account that it had called a large number of Weibo blog posts, and since the uid identification number and blog post reference ID number set by the plaintiff for Weibo users using Weibo were unique and not publicly available, the defendant did not provide evidence on how it used the above-mentioned "website + blog post ID number" method to obtain the ID number corresponding to the above-mentioned massive blog posts. At the same time, the two methods stated by the defendant to obtain the plaintiff's blog post were time-consuming, which was obviously contrary to the defendant's claim that it had "advanced data collection, processing and analysis technology".

When appealing, the defendant insisted that it "only provided users with API data interface technical services, and did not capture data, did not have storage conditions, and did not store and sell data".

On the issue of data scraping behavior and specific means, the defendant appealed and asserted:

  • By combining the parameters entered by the user with the Weibo API link published on the open source website, it forms a URL link to access the Weibo server, initiates a data request to the Weibo server, and then transmits the returned data to the user in real time.
  • To substantiate its claim, submit evidence that you can view Weibo data through Chrome Developer Mode.

The court of second instance held that it could be inferred that the defendant must have captured the Weibo data involved in the case from some channel and in some way based on the following circumstances:

  • The evidence on record shows that the Weibo data content provided by the 11 API interfaces in the case covered the data content on the corresponding Weibo webpages, and contained a large amount of content that was not publicly displayed on the webpages;
  • Combined with the fact that the 11 API interfaces are set up under the idataapi.cn domain name, it can be seen that the Weibo data involved in the case must be provided externally through the iDataAPI website server;
  • The plaintiff did not provide the above-mentioned large amount of Weibo data to a third party outside the case.

The plaintiff's claim is further accepted on the following grounds:

  • The evidence on record is sufficient to prove that the IP address and the uid of the Weibo user in each data request sent by the iDataAPI server to the Weibo server are different, that is, the defendant carried out the act of disguising the IP address and the uid of the Weibo user, and its purpose was to deceive the Weibo server into mistakenly believing that the data requester was a different real Weibo user;
  • Although users can obtain part of the Weibo data when browsing the Weibo web page normally with the browser, it cannot obtain a large amount of data that is no longer publicly displayed on the Weibo web page, and the efficiency of obtaining data in this way is obviously not comparable with the data provided by the iDataAPI website.
  • Although you can view more Weibo data (including some data that is not publicly displayed) through the Chrome browser developer mode, compared with the data provided by the iDataAPI website, it contains multiple pieces of data that cannot be obtained through the browser (such as "reactions", "postCount", and "urank" provided by iDataAPI Accused API 1) and this method requires multi-step manual operation, and the efficiency cannot match the data provided by the iDataAPI website in 1-10 times/second;
  • Although the defendant appealed that the program code for "collecting Weibo data" published by an unidentified user and the other two ways in which Weibo data could be obtained (URL addresses published on third-party websites, Baidu search engine snapshots, etc.) proposed by the defendant in the first instance of the case, although a certain amount of Weibo data could be obtained, there was no evidence to prove whether the source channel was legal and compliant, and it was inconsistent with the behavior of the iDataAPI website in providing Weibo data, so it could neither be a so-called "legitimate source" nor its claim" advanced data acquisition, processing, and analysis technologies" are clearly contradictory.

There is also a question here, that is, when the plaintiff claims that the defendant uses the means of disguising users and circumventing technical measures, the object of evidence collection only involves the accused interface 3, can it be inferred that the other 10 interfaces also use the same technical means?

  • The plaintiff's explanation as to why only this interface was chosen was:

1) The data requests sent by the iDataAPI server are hidden in the data requests generated by many real Weibo users browsing Weibo normally, and it is indeed impossible for the Weibo server to identify them in advance, so they can only take advantage of the press release characteristics of iDataAPI to directly sell Weibo data publicly, and use the method of "inserting special fields" to collect evidence;

2) However, if abnormal data that is not preset by the server and the user is inserted for a long time and on a large scale, it may lead to the normal operation of the microblogging service or unexpected serious consequences;

3) Interface 3 happens to correspond to the specific subdivided data interface of "Weibo Headline Article", so under the premise of taking into account the needs of forensics and the normal operation of the microblogging platform, only this data interface is selected as an example for forensics.

  • Based on the following circumstances, the court of second instance can infer that the same technical means are used in all 11 interfaces sued:

1) The plaintiff actively and tried his best to adduce evidence, at least to prove that the iData API server changed different IP addresses and Weibo user uids and disguised the browser UA information that only real users can use when requesting the "Weibo Toutiao Article" data to the Weibo server. The defendant could not provide a reasonable explanation for this, and only denied it in general, although the defense put forward other possible channels and methods for obtaining Weibo data, but it was inconsistent with the facts of his behavior, and it was difficult to justify himself.

2) The frequency of data provided by the 11 data interfaces accused of iDataAPI is the same (concurrency: 3 times/second), and the defendant has never proposed that a certain data interface of iDataAPI uses different technical means alone, so it can be inferred that the 11 data interfaces accused of iDataAPI all use the same technical means.

Tomorrow, let's look at the controversy related to the storage and sale of data.

Reference: Intellectual Property Treasure WeChat Official Account, released on January 19, 2024 "Competition Case|The First Case Involving Weibo Data Capture Transaction, Guangdong High Court Final Judgment!" (Shenzhen Intermediate People's Court of Guangdong Province (2020) Yue 03 Min Chu No. 4626)

Read on