Configure Sites¶
Websites are configured through a temporary browser that is embedded directly in the Browsertrix interface. Every website that is visited using the embedded browser is added to the list of Saved Sites. When the embedded browser session ends, personalized data from the sites are collected into a profile. This profile of preconfigured sites can then be saved and used by multiple crawl workflows.
The embedded browser is used during the process of creating a new browser profile and when editing an existing profile.
Use Cases¶
Website Sign In¶
To crawl content as a logged in user, load the website you intend to archive in the embedded browser and sign in as you would on any other browser. Once the account has been logged in, confirm by accessing a page on the site that the crawler should have access to. You may need to periodically log in again as websites may log users out after a certain period of time.
Tip: Crawl regularly to stay logged in
Regularly running crawl workflows that use a browser profile can help to reduce the frequency with which logouts occur on some websites. Data such as cookies and sessions may be refreshed during crawling, and Browsertrix will automatically update the browser profile with this data when each crawl successfully finishes.
Hide Popups¶
Load the website you intend to archive in the embedded browser and accept or otherwise dismiss the prompt. If the developers of the website have built the site in such a way that the result of your interaction is saved, the popup should remain hidden at crawl time. This can be confirmed by exiting the embedded browser session and then loading the site again.
Customize the Crawling Browser¶
The embedded browser used to configure profiles is the same browser behind Browsertrix’s high-fidelity crawls. This enables advanced use cases like using a browser profile to customize the browser at crawl time. To view all available browser settings, load any site in the profile and then navigate to brave://settings in the embedded browser.
Advanced Use Case: Proceed with caution
Customizing the crawler browser is for advanced use cases and it is not generally recommended to change these settings. We offer crawl-time browser customization like ad blocking and language in workflow settings. Changing browser settings directly in the profile may result in conflicting settings that are difficult to troubleshoot. If using this advanced feature, we recommend adding clear metadata to the browser profile that describes the change.
Example: Blocking page resources with Brave's Shields
Whereas the crawler's scoping settings can be used to define which pages should be crawled, Brave's Shields feature can block resources on pages from being loaded. By default, Shields will block EasyList's cookie list but it can be set to block a number of other included lists under Brave Settings > Shields > Filter Lists.
Custom Filters can also be useful for blocking sites with resources that aren't included in one of the known block lists.
The uBlock Origin filter syntax can be used for more specificity over what in-page resources should be blocked.
All of the browser's ad blocking and privacy features can be used in combination with the Block Ads by Domain crawler setting.
Saving the Profile¶
After you are done interacting with the embedded browser, press Save Profile (or Create Profile for new browser profiles.)
Saved Sites¶
All sites that are loaded in the embedded browser and then saved will appear in the Saved Sites list. Select a site in the list to view or reconfigure the site in the embedded browser.
Load New URL¶
You may want to load a URL that is not listed in the Saved Sites to preview how a page may appear to the crawler, or to add a new site. Due to the nature of the embedded browser, it can be difficult to navigate between different websites if there are no hyperlinks between them. The easiest way to load a new URL is to press Load New URL from the browser profile page and enter the URL.
Although browser profiles have no limit on the number of saved sites, we recommend one site per browser profile to make troubleshooting crawls easier. An exception is when using a URL List workflow to crawl multiple websites that require a profile, as we only allow one browser profile per workflow.