laitimes

Crawler scenario analysis

author:VendyZ

I believe that everyone has had the experience of grabbing train tickets during the Spring Festival, and they will not be unfamiliar with some ticket-grabbing software. Today we will take a look at what is behind the ticket-grabbing software from a technical point of view - crawlers. In layman's terms, crawlers are detection machines that simulate human behavior to go to various websites and carry the information they see back. Let's analyze the advantages and disadvantages of crawler application scenarios~

One: the travel industry

The most widely used track is in tourism. Why? Take 12306 as an example. As the only official website in China that sells train tickets, if you want to take the train, you must buy tickets on this website. As a result, it spawned a lot of amazing tools for grabbing tickets. For example, Zhixing train tickets use crawler technology to constantly refresh train tickets on the 12306 website. Once you find a ticket, you can immediately take it and make you pay. Of course, the benefits are very clear. As long as we know how to use our fingers, we can grab tickets at home, but the 12306 website itself does not welcome this kind of crawling. After all, high frequency of page views and clicks can cause websites to crash, which is unfair to those who can't grab tickets. Therefore, crawling technology has advantages and disadvantages for the tourism industry.

Two: social platforms

Social platforms are one of the places frequented by reptiles, especially Weibo. Crawlers can get a person's microblog list, microblog status, index, and more. Some may ask, what is the use of this information? Imagine if I could command a group of bots at will, open someone's tweet, click on an item, and then frantically follow it, like or leave a message. This is the standard zombie powder workflow. The number of zombie fans, likes, comments, etc. can be transmitted to a Weibo account through this set of riot operations. There are also operations such as zombie fans made with reptiles to grab red envelopes on Weibo.

Third: e-commerce platform

I believe everyone is familiar with the so-called "price comparison platform", "aggregate e-commerce" and "rebate platform". In fact, their principle is also the application of crawler technology. For example, if you search for a product, this aggregation platform automatically places the products of various e-commerce companies in front of you for you to choose from. There are Taobao, JD.com and Vipshop Suning. This is the credit of reptiles. They go to Taobao, get an image and price of a certain product, and display it on their platform. This principle is similar to how search engines work, except that instead of displaying web pages, they display products. But comparing prices together can be a good thing for consumers, but many ecommerce platforms don't think so. Of course, e-commerce has another way to fight crawlers, and that is the "web application firewall". Anti-crawler techniques are not discussed here.

Four: search engines

As we all know, search engines decide which page ranks first. One of the main metrics is to see which search results are clicked more often. One black SEO method is to use crawlers to constantly brush page click traffic. For example, if you search for a specific "keyword" and then desperately click on the link in the results, the site's weight in search engines will naturally rise. But this approach is wrong. This is a disadvantage that crawlers are used to exploit. No search engine can allow outsiders to tamper with its own search results, otherwise it will lose publicity. Therefore, Baidu search engine will adjust the algorithm from time to time to combat black SEO behavior. Once the website is discovered, it will be subject to "power outages", and the gains outweigh the losses. In general, the crawling technique has advantages and disadvantages. It depends on how you use it.

Analyzing the pros and cons of crawler application scenarios, we can find that crawler technology is more like a double-edged sword, and the technology itself is not guilty, mainly depending on how the people who use crawler technology use it. Of course, it is not illegal to crawl public information on the network, and it is absolutely not advisable to use crawler technology to steal private information for profit. In short, everyone must use crawler technology to the extent permitted by law.

#反爬虫策略

There are crawlers, there may be anti-crawlers, some websites have sensitive data, do not want you to obtain, then the company will take various anti-crawling measures.

1. Block IP

This is a relatively simple and rude way, query the account with too many requests per unit time, and then find the computer IP of the account, and directly block the access of this computer, but the accidental injury rate is also relatively high, so use it with caution.

Second, replace sensitive information with pictures

The commodity price information of e-commerce platforms is more sensitive, and some platforms will replace the price and model information with pictures to display, which can indeed prevent crawlers, but with the development of machine learning, the technology to identify pictures is getting stronger and stronger, and slowly the effect of this treatment method is not so good.

3. What you see on the web page is not what you get

Through certain algorithm rules, the false information and the real information are mapped, and the false information is stored in the web page code, but when it is displayed, the algorithm rules and TTF font files are used to map the real information.

4. Manually enter dynamic codes

Some websites avoid being crawled, such as entering a dynamic code that verifies your identity and has an expiration date before you visit the page.

5. Legal Channels

Are crawlers illegal? The current crawler is still a bit of a scratch at the legal level, and there are still lawsuits for crawlers, and it is also a way to protect data through legal channels

Legality of the crawler:

As the saying goes: "Crawlers crawl with joy, prisons sit through; Data plays, prison food enough."

Status of legal norms related to web crawlers:

At the legal level, the mainland has relevant provisions that when online crawlers infringe on personal privacy, judicial practice often punishes them for the crime of infringing on citizens' personal information under Article 253 of the Criminal Law, and when they simply infringe on network data information, they are usually regulated by Articles 285 and 286 of the Criminal Law. However, there are also certain shortcomings, due to the rapid development of new Internet technologies, it is difficult for legislative work to keep pace with the times, so the existing laws and regulations have the characteristics of lag and conservatism; In addition, the web crawler application itself has the characteristics of difficulty in judging whether it is a crime or not, and its behavior does not have a clear distinction standard in judicial practice, resulting in judges may be helpless when facing such cases.

Whether a crawler is illegal or not depends on the circumstances.

Legitimate crawlers:

The legitimate application of web crawlers needs to pay attention to the following points: 1. When crawling public data, network crawlers are required to crawl without identification; 2. Crawlers crawling data on the Internet cannot affect the normal operation of other people's servers; 3. Crawling data does not affect the normal business of others. The normal use of network crawlers will not violate the red line of the law, and this technology has strong practicality, breaking information barriers, providing network users with huge information acquisition convenience, and bringing huge commercial benefits and development opportunities to commercial institutions. The benefits of using web crawlers legally outweigh the disadvantages.

Illegal crawlers:

1. Malicious crawling of users' personal data - the personal privacy of citizens that may be infringed. The Mainland Cybersecurity Law and Criminal Law both make relevant provisions on the protection of citizens' personal information, and when online crawlers maliciously crawl personal information, judicial practice often applies the relevant provisions of Articles 41 and 44 of the Cybersecurity Law and Article 2853-1 of the Criminal Law to make judgments on criminal acts, which may be sentenced to the crime of infringing on citizens' personal information.

(Article 41 of the Cybersecurity Law: The collection and use of personal information shall follow the principles of legality, propriety, and necessity, disclose the rules for collection and use, clearly indicate the purpose, method, and scope of the collection and use of information, and obtain the consent of the person being collected.)

Network operators must not collect personal information unrelated to the services they provide, must not collect or use personal information in violation of the provisions of laws, administrative regulations, and the agreements of both parties, and shall handle the personal information they retain in accordance with the provisions of laws, administrative regulations and agreements with users.

Article 44 of the Cybersecurity Law: No individual or organization may steal or otherwise illegally obtain personal information.

Article 253-1 of the Criminal Law [Crime of Infringing on Citizens' Personal Information] Where relevant state provisions are violated by selling or providing citizens' personal information to others, and the circumstances are serious, they shall be sentenced to fixed-term imprisonment of not more than three years or criminal detention, and shall be fined concurrently or alone; If the circumstances are particularly serious, they shall be sentenced to fixed-term imprisonment of not less than three years and not more than seven years, and shall also be fined.

Where relevant state provisions are violated by selling or providing to others citizens' personal information obtained in the course of performing duties or providing services, a heavier punishment is to be given in accordance with the provisions of the preceding paragraph.

Where citizens' personal information is stolen or illegally obtained by other methods, punishment is to be given in accordance with the provisions of paragraph 1.

Where a unit commits the crimes mentioned in the preceding three paragraphs, it shall be fined and its directly responsible managers and other directly responsible personnel shall be punished in accordance with the provisions of those paragraphs. )

2. The page indicates that no crawling or unauthorized crawling - Crawling behavior that deliberately avoids or forcibly breaks through the anti-crawler technical settings of the website or App without authorization is "unauthorized" access to or acquisition of data, and the perpetrator shall bear corresponding responsibilities including criminal liability according to law. According to Articles 285 and 286 of the Mainland Criminal Law, the crimes that may be involved in breaking through technical barriers to invade another person's computer system and obtaining data in the system include the crime of illegally intruding into a computer information system, illegally obtaining data from a computer information system, and damaging a computer information system.

(Criminal Law, Article 285 [Crime of Illegally Intruding into Computer Information Systems] Whoever violates state regulations by intruding into computer information systems in the fields of state affairs, national defense construction, or cutting-edge science and technology shall be sentenced to fixed-term imprisonment of not more than three years or criminal detention.

[Crimes of Illegally Obtaining Computer Information System Data or Illegally Controlling Computer Information Systems] Where state provisions are violated by intruding into computer information systems other than those provided for in the preceding paragraph or using other technical means to obtain data stored, processed, or transmitted in that computer information system, or exercising illegal control over the computer information system, where the circumstances are serious, give a fixed-term imprisonment of not more than three years or criminal detention, and give a concurrent fine or a single fine; If the circumstances are particularly serious, they shall be sentenced to fixed-term imprisonment of not less than three years and not more than seven years, and shall also be fined.

[Crime of Providing Programs or Tools for Hacking into or Illegally Controlling Computer Information Systems] Where programs or tools are provided specifically for intruding into or illegally controlling computer information systems, or providing programs or tools to others knowing that they have committed unlawful or criminal acts of intruding into or illegally controlling computer information systems, and the circumstances are serious, punishment is to be given in accordance with the provisions of the preceding paragraph.

Where a unit commits the crimes mentioned in the preceding three paragraphs, it shall be fined and its directly responsible managers and other directly responsible personnel shall be punished in accordance with the provisions of those paragraphs.

Article 286 of the Criminal Law [Crime of Destroying Computer Information Systems] Where a person deletes, modifies, adds, or interferes with the functions of a computer information system in violation of state regulations, causing the computer information system to fail to operate normally, and the consequences are serious, he shall be sentenced to fixed-term imprisonment of not more than five years or criminal detention; If the consequences are particularly serious, they shall be sentenced to fixed-term imprisonment of not less than five years.

Where state provisions are violated by deleting, modifying, or adding to data and application programs stored, processed, or transmitted in computer information systems, and the consequences are serious, punishment is to be given in accordance with the provisions of the preceding paragraph.

Where destructive programs such as computer viruses are intentionally produced or disseminated, affecting the normal operation of computer systems, and the consequences are serious, punishment is to be given in accordance with the provisions of paragraph 1. )

3. Affecting business, affecting servers, crawling part of the website, and APP data exceeding the specified amount - the law also has clear provisions on the number of information crawled by web crawlers, the number of visits and web crawlers that affect the normal operation of websites, and when the above situations occur, they are regulated in accordance with Article 16 of the Data Security Management Measures. (Article 16 of the Data Security Management Measures is the first time that the state has explicitly regulated crawlers.) )

(Article 16 of the Measures for the Management of Data Security: The use of automated means to access and collect website data must not obstruct the normal operation of the website; Such behavior seriously affects the operation of the website, such as automated access collection traffic exceeds one-third of the average daily traffic of the website, and the website requests to stop automated access collection, it should be stopped. )

Besides: there is ↓

It is clearly stated that crawling is not allowed

Add /robots after the domain name .txt view

Problems similar to DDOS attacks

Disallow means no crawlers, allow allowed.

But not all websites will have robots.txt to provide information on whether crawlers are allowed, and that's up to you

The data we can crawl does not represent legitimacy and requires careful judgment.

Tip: Although some crawlers are illegal, the company or enterprise will not directly report to the police. Anti-climbing means will be used, and the police will be called after it is serious.

Limitations of the robots protocol:

In addition to the law, there is the ROBOTS protocol, a technical specification in the industry, but the ROBOTS agreement is not a legal agreement, nor a contract in the legal sense, but an unofficial agreement that has not been filed with a standardization organization. This protocol has no mandatory force, can only play a role in reminding, can not supervise and block the violations of network crawlers, so the implementation of this protocol requires crawler users to consciously abide by. At the same time, the content involved in this agreement is not perfect, and it is impossible to reasonably and effectively regulate various problems in the use of web crawlers, and in practice, there are also situations where enterprises use web crawlers that do not violate the content of the agreement but violate the law. It can be seen that relying only on the robots protocol is difficult to ensure the legalization of the use of web crawlers.

Solutions:

Certain measures need to be taken to maintain the legitimacy of web crawler technology. First of all, the scope of use of web crawlers should be further limited, such as forcing web crawlers to only crawl information disclosed on the Internet, the use of web crawlers must not affect the normal operation of the original website, personal information crawled by web crawlers must not be disclosed without consent, and the purpose of web crawlers should be clarified before crawling information. The technology of web crawlers is a double-edged sword, how to use is the key to determine its value, and it is a feasible way to limit the legality of web crawler technology. See the paper ↓ for details

(For details, see the paper "The Evolution of Web Crawlers and Their Legality Limits": In order to reflect and maintain the neutrality of technology, web crawlers need to legally define their legality, that is, to delineate the boundaries of the legal use of crawler technology.) We believe that web crawlers can be legally qualified from the following three aspects. First, as far as crawling objects are concerned, web crawlers should be crawling for public data; Second, web crawlers should not be intrusive in terms of the means or methods used for data scraping. Whether it is invasive should be judged from two aspects: whether the technology itself is invasive and whether the data crawling behavior complies with the crawler agreement and contractual agreements. Third, from the perspective of the main body of development and utilization of network crawler technology, the "legitimacy of purpose" should be limited. These three qualifications are sufficient conditions for the judgment of the legitimacy of web crawlers, that is, only data crawling behaviors that meet these three conditions are legal. Conversely, if any of these conditions are not met, the data scraping behavior can be considered illegal. ...... )

Second, legislation should be strengthened to promote legislation in the Cybersecurity Law and Criminal Law on the protection of personal privacy information and the attribution of data in the Internet era; It can also set up special offices and expert groups to adjust the legislation of web crawler technology in real time to keep up with the development of science and technology. In view of the above-mentioned rapid progress of science and technology, it is difficult for legislative work to keep pace with the times, and the existing laws and regulations have the characteristics of lag and conservatism, the above problems can be alleviated by using methods such as frequent updating and frequent interpretation of laws and regulations by expert groups, publicity and collective learning of typical cases.

In addition, a sound and reasonable reporting mechanism should be established to encourage the reporting of illegal crawling behavior. Add a number of supervision and reporting channels, such as opening special websites, offices, mailboxes, etc. Under China's current national conditions, a special technical department can be set up to carry out special supervision of crawlers on the Internet, and link with the procuratorate, complementing each other, technical personnel find and deal with illegal crawlers on the network, and the procuratorate supervises the department itself to prevent deterioration within the department.

In addition, real-name authentication can also be carried out for individuals and enterprises using web crawler technology, without authentication it cannot be used, once discovered, it will be severely punished; Or regularly organize judicial and law enforcement personnel to study relevant laws and regulations.