Skip to content

Crawl Workflow Settings

One of the key features of Browsertrix is the ability to refine crawler settings to the exact specifications of your crawl and website.

Changes to a setting will only apply to subsequent crawls.

Crawl settings are shown in the crawl workflow detail Settings tab and in the archived item Crawl Settings tab.

Crawl Scope

Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose URL List or Site Crawl when creating a new workflow.

Crawling with HTTP basic auth

Both Page List and Site Crawls support HTTP Basic Auth which can be provided as part of the URL, for example: https://username:password@example.com.

These credentials WILL BE WRITTEN into the archive. We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.

Crawl Type: Page List

Page URL(s)

A list of one or more URLs that the crawler should visit and capture.

Include Any Linked Page

When enabled, the crawler will visit all the links it finds within each page defined in the Crawl URL(s) field.

Crawling tags & search queries with Page List crawls

This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the Crawl URL(s) field, e.g: https://example.com/search?q=tag, and enable Include Any Linked Page to crawl all the content present on that search query page.

Fail Crawl on Failed URL

When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed".

Crawl Type: Site Crawl

Crawl Start URL

This is the first page that the crawler will visit. It's important to set Crawl Start URL that accurately represents the scope of the pages you wish to crawl as the Start URL Scope selection will depend on this field's contents.

You must specify the protocol (likely http:// or https://) as a part of the URL entered into this field.

Start URL Scope

Hashtag Links Only

This scope will ignore links that lead to other addresses such as example.com/path and will instead instruct the crawler to visit hashtag links such as example.com/#linkedsection.

This scope can be useful for crawling certain web apps that may not use unique URLs for their pages.

Pages in the Same Directory
This scope will only crawl pages in the same directory as the Crawl Start URL. If example.com/path is set as the Crawl Start URL, example.com/path/path2 will be crawled but example.com/path3 will not.
Pages on This Domain
This scope will crawl all pages on the domain entered as the Crawl Start URL however it will ignore subdomains such as subdomain.example.com.
Pages on This Domain and Subdomains
This scope will crawl all pages on the domain and any subdomains found. If example.com is set as the Crawl Start URL, both pages on example.com and subdomain.example.com will be crawled.
Custom Page Prefix
This scope will crawl all pages that begin with the Crawl Start URL as well as pages from any URL that begin with the URLs listed in Extra URL Prefixes in Scope

Max Depth

Only shown with a Start URL Scope of Pages on This Domain and above, the Max Depth setting instructs the crawler to stop visiting new links past a specified depth.

Extra URL Prefixes in Scope

Only shown with a Start URL Scope of Custom Page Prefix, this field accepts additional URLs or domains that will be crawled if URLs that lead to them are found.

This can be useful for crawling websites that span multiple domains such as example.org and example.net

Include Any Linked Page ("one hop out")

When enabled, the crawler will visit all the links it finds within each page, regardless of the Start URL Scope setting.

This can be useful for capturing links on a page that lead outside the website that is being crawled but should still be included in the archive for context.

Check For Sitemap

When enabled, the crawler will check for a sitemap at /sitemap.xml and use it to discover pages to crawl if found. It will not crawl pages found in the sitemap that do not meet the crawl's scope settings or limits.

This can be useful for discovering and capturing pages on a website that aren't linked to from the seed and which might not otherwise be captured.

Exclusions

The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in Page List crawls when Include Any Linked Page is enabled.

This can be useful for avoiding crawler traps — sites that may automatically generate pages such as calendars or filter options — or other pages that should not be crawled according to their URL.

Matches text

Will perform simple matching of entered text and exclude all URLs where matching text is found.

e.g: If about is entered, example.com/aboutme/ will not be crawled.

Regex

Regular expressions (Regex) can also be used to perform more complex matching.

e.g: If \babout\/?\b is entered, example.com/about/ will not be crawled however example.com/aboutme/ will be crawled.

Limits

Enforce maximum limits on your crawl.

Max Pages

Adds a hard limit on the number of pages that will be crawled. The crawl will be gracefully stopped after this limit is reached.

Crawl Time Limit

The crawl will be gracefully stopped after this set period of elapsed time.

Crawl Size Limit

The crawl will be gracefully stopped after reaching this set size in GB.

Page Load Timeout

Limits amount of elapsed time to wait for a page to load. Behaviors will run after this timeout only if the page is partially or fully loaded.

Delay After Page Load

Waits on the page after initial HTML page load for a set number of seconds prior to moving on to next steps such as link extraction and behaviors. Can be useful with pages that are slow to load page contents.

Behavior Timeout

Limits amount of elapsed time behaviors have to complete.

Auto Scroll Behavior

When enabled, the browser will automatically scroll to the end of the page.

Delay Before Next Page

Waits on the page for a set period of elapsed time after any behaviors have finished running. This can be helpful to avoid rate limiting however it will slow down your crawl.

Browser Settings

Configure the browser used to visit URLs during the crawl.

Browser Profile

Sets the Browser Profile to be used for this crawl.

Browser Windows

Sets the number of browser windows that are used to visit webpages while crawling. Increasing the number of browser windows will speed up crawls by capturing more pages in parallel.

There are some trade-offs:

  • This may result in a higher chance of getting rate limited due to the increase in traffic sent to the website.
  • More execution minutes will be used per-crawl.

Crawler Release Channel

Sets the release channel of Browsertrix Crawler to be used for this crawl. Crawls started by this workflow will use the latest crawler version from the selected release channel. Generally "Default" will be the most stable, however others may have newer features (or bugs)!

This setting will only be shown if multiple different release channels are available for use.

Block Ads by Domain

Will prevent any content from the domains listed in Steven Black's Unified Hosts file (ads & malware) from being captured by the crawler.

User Agent

Sets the browser's user agent in outgoing requests to the specified value. If left blank, the crawler will use the Brave browser's default user agent. For a list of common user agents see useragents.me.

Using custom user agents to get around restrictions

Despite being against best practices, some websites will block specific browsers based on their user agent: a string of text that browsers send web servers to identify what type of browser or operating system is requesting content. If Brave is blocked, using a user agent string of a different browser (such as Chrome or Firefox) may be sufficient to convince the website that a different browser is being used.

User agents can also be used to voluntarily identify your crawling activity, which can be useful when working with a website's owners to ensure crawls can be completed successfully. We recommend using a user agent string similar to the following, replacing the orgname and URL comment with your own:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.3 orgname.browsertrix (+https://example.com/crawling-explination-page)

If you have no webpage to identify your organization or statement about your crawling activities available as a link, omit the bracketed comment section at the end entirely.

This string must be provided to the website's owner so they can allowlist Browsertrix to prevent it from being blocked.

Language

Sets the browser's language setting. Useful for crawling websites that detect the browser's language setting and serve content accordingly.

Scheduling

Automatically start crawls periodically on a daily, weekly, or monthly schedule.

Tip: Scheduling crawl workflows with logged-in browser profiles

Some websites will log users out after a set period of time. When crawling with a custom browser profile that is logged into a website, we recommend checking the profile before crawling to ensure it is still logged in.

This can cause issues with scheduled crawl workflows — which will run even if the selected browser profile has been logged out.

Crawl Schedule Type

Run Immediately on Save
When selected, the crawl will run immediately as configured. It will not run again unless manually instructed.
Run on a Recurring Basis
When selected, additional configuration options for instructing the system when to run the crawl will be shown. If a crawl is already running when the schedule is set to activate it, the scheduled crawl will not run.
No Schedule
When selected, the configuration options that have been set will be saved but the system will not do anything with them unless manually instructed.

Frequency

Set how often a scheduled crawl will run.

Day

Sets the day of the week for which crawls scheduled with a Weekly Frequency will run.

Date

Sets the date of the month for which crawls scheduled with a Monthly Frequency will run.

Start Time

Sets the time that the scheduled crawl will start according to your current timezone.

Also Run a Crawl Immediately On Save

When enabled, a crawl will run immediately on save as if the Run Immediately on Save Crawl Schedule Type was selected, in addition to scheduling a crawl to run according to the above settings.

Metadata

Describe and organize your crawl workflow and the resulting archived items.

Name

Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the Crawl Start URL. For Page List crawls, the workflow's name will be set to the first URL present in the Crawl URL(s) field, with an added (+x) where x represents the total number of URLs in the list.

Description

Leave optional notes about the workflow's configuration.

Tags

Apply tags to the workflow. Tags applied to the workflow will propagate to every crawl created with it at the time of crawl creation.

Collection Auto-Add

Search for and specify collections that this crawl workflow should automatically add archived items to as soon as crawling finishes. Canceled and Failed crawls will not be added to collections.