Search engine software for creating your own large-scale search engine similar to Google and Bing (and just as large) by combining our highly configurable webcrawler with Elasticsearch. Index and search billions of web pages. Our software supports crawling and indexing specific domains, the entire web, or for creating niche or topic specific search engines. You have full control over where the webcrawler crawls and what it indexes.
GUI App for Windows 7, 8, 8.1, and 10.
Cross-platform headless under development.
Sample Web app example.
Download Search Engine Software
Freeware vs Licensed
Below are the limitations of the freeware version. This list will grow as more features are implemented.
- Freeware version is limited to 1 (one) crawler thread. Crawler threads increase how many pages can be visited and indexed concurrently.
- Freeware version doesn't support using MySQL for the webcrawler database.
- Freeware version is limited to 100,000 indexed web pages. Licensed version imposes no limitation (index without limits).
Pricing & Buying
ScraperHut Search Engine Software is available for just $380.00, which includes all features, updates, upgrades, and support for the life of your ScraperHut account.
--> Buy Now <--
(note: you must be logged in to buy)
Our Search Engine Software Solution
This product is designed to create a Google or Bing-like search engine. It crawls the web, indexing only the web pages you care about. These indexed web pages can then be searched, providing a similar search result page to that of Google or Bing. Our search engine software consists of two parts:
- Webcrawler - This GUI application crawls the web, keeping only the pages you care about. This is an extremely feature-rich webcrawler, giving you ultimate control. The webcrawler uses either the built-in Firebird database or MySQL for storing information it needs, such as the URLs it's visited, URLs in the queue to be crawled, etc.
- Search Engine - This is where Elasticsearch comes in. While crawling (or after), parsed web pages are added to your Elasticsearch database. This is your search index, which provides the Google-like search results. Elasticsearch is not required, but without it you'd have to implement your own search solution.
Search Engine Software: Our Webcrawler
Our webcrawler is extremely configurable and multi-threaded. You have complete control over what links are followed/indexed, what links are blocked by using patterns, what content the pages must contain to be indexed, how fresh web pages must be, and much, much more. 2004 was the last time an equivalent product was created (Innerprise ES.NET 2004) and this is by the same developer.
Create Your Own Search Engine
Why create your own search engine? There are many reasons why you'd want to create your own search engine. Below are just some of those reasons:
- By using our search engine software, you have full-control over your own search engine. You aren't dependent on Google or Bing to crawl and index web pages or subject to their API limitations or pricing. Crawl and index what you want, when you want.
- Google has become quite biased with its search results. It's possible to create a large-scale search engine using our software, up to billions of pages. In the late 1990's and early 2000's there were quite a few major search engines that crawled the web and had their own indexes. It wasn't controlled by two companies like it is today. You may remember some of the names: Lycos, Infoseek, Excite, Webcrawler, Altavista, to name a few.
- With recent privacy-abuse revelations coming to light, now is the optimal time for new search engines to be born, giving people back viable choices of places to search.
Your Own Large-Scale Search Engine
Creating a massive search engine is an undertaking that doesn't happen overnight. Elasticsearch can scale to indexing billions of documents and our embedded database (Firebird) supports up to 18TB of data and billions of rows. Firebird, though, is intended for a quick start. Our search engine software also supports MySQL (or MariaDB), which is the best option for large-scale web crawling. Indexing billions of pages takes time, but given that the majority of pages don't change or don't change often, lets the webcrawler focus most of it's time on discovering new pages. By using MySQL (or MariaDB), you can run our webcrawler from more than one machine, while working with the same data.
Distributed Web Crawling
To crawl the most web pages in the shortest amount of time, you'll want to install our webcrawler on multiple machines and use MySQL (or MariaDB) as your database, This allows each webcrawler it's own resources while working with the same set of data. Alternatively, you may consider having a separate webcrawling machine per TLD (ie, .com, .net, etc), each with it's own database. This helps eliminate any disk I/O bottlenecks. All of the data gets fed back into Elasticsearch.
Search Engine Software Tools
When you download our search engine software, you receive the following:
- Our custom webcrawler, which feeds Elasticsearch with pages.
- Sample web applications to search and display results from the data, in PHP and .NET.
Elasticsearch itself can be downloaded for free and is not currently bundled with our software. The webcrawler can be used without it, but will be unable to provide searching capabilities (you would have to provide your own).
Elasticsearch is open-source software and was created for searching through massive amounts of data quickly. Verizon searches more than 500 billion documents using Elasticsearch. The features of Elasticsearch make it perfect for creating Google-like search engines.
Search Engine Software Pricing
Our search engine software is available in both a Freeware edition, as well as a Licensed edition. The Licensed edition is available for just $380.00 USD per machine it's installed on, which includes priority support via our support forum and private messaging, as well as free updates and upgrades for the life of your ScraperHut account.
Below are some of the differences between the Freeware edition and Licensed edition:
- The Licensed version supports using MySQL (or MariaDB) for the webcrawler database. This is recommended when crawling massive amounts of web pages. The Freeware edition only supports the built-in Firebird and LiteDB databases.
- The Freeware version is limited to a single webcrawling thread. Essentially, this affects the number of web pages the webcrawler can visit at any given time. With a single thread, the webcrawler can visit only one web page at time. With 20 threads, the webcrawler can visit 20 pages at a time. The Licensed version imposes no limitations itself and you'll want to play around with the setting to find the optimal number. A low-end machine (Pentium G4600) with an SSD can crawl over 1 million web pages per day with 15 crawler threads and a 100Mb Internet connection.
Neither the Freeware or Licensed editions impose a crawling or page limit. Crawl and index as many pages as you like.