https:// en.wikipedia.org/wiki/Web_crawler
A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.
Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For instance, including a robots.txt file can request bots to index only parts of a website, or nothing at all.
The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today relevant results are given almost instantly.
Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping (see also data-driven programming).
A Web crawler may also be called a Web spider,[1] an ant, an automatic indexer,[2] or (in the FOAF software context) a Web scutter.[3]
Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.
As noted by Koster, the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community.[32] The costs of using Web crawlers include:
network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time;
server overload, especially if the frequency of accesses to a given server is too high;
poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle; and
personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.
Open-source crawlers[edit]
Frontera is web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search used to crawl the web.
Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL (*NIX machines only)
news-please is an integrated crawler and information extractor specifically written for news articles under the Apache License. It supports crawling and extraction of full-websites (by recursively traversing all links or the sitemap) and single articles.[61]
Apache Nutch is a highly extensible and scalable web crawler written in Java and released under an Apache License. It is based on Apache Hadoop and can be used with Apache Solr or Elasticsearch.
Open Search Server is a search engine and web crawler software release under the GPL.
PHP-Crawler is a simple PHP and MySQL based crawler released under the BSD License.
Scrapy, an open source webcrawler framework, written in python (licensed under BSD).
Seeks, a free distributed search engine (licensed under AGPL).
Sphinx (search engine), a free search crawler, written in c++.
StormCrawler, a collection of resources for building low-latency, scalable web crawlers on Apache Storm (Apache License).
tkWWW Robot, a crawler based on the tkWWW web browser (licensed under GPL).
Xapian, a search crawler engine, written in c++.
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
Octoparse, a free client-side Windows web crawler written in .NET.