Scraper Overview

How?#

The DocSearch scraper is written in Python and heavily inspired by the Scrapy framework. It goes through all pages of your website and extracts content from the HTML structure to populate an Algolia index.

It automatically follows every internal link to make sure we are not missing any content, and uses the semantics of your HTML structure to construct its records. This means that h1,h2, etc., (selectors) titles are used as hierarchy, and each p is used as a potential result.

Those CSS selectors can be overwritten, and each website has its own JSON configuration file that describes in more detail how the scraper should behave. You can find the complete list of options in the related section.

When?#

We automatically run each config every 24 hours. This is done from our own infrastructure, meaning that you don't need to install anything on your side. We run this service entirely free of charge, but we ask that you keep the "Search by Algolia" logo next to the search results.

That being said, if you'd like to run DocSearch on your own, all the code is open sourced and even packaged as a Docker image. Download it, and run it with your own credentials.