Intro to Crawl Workflows¶
Crawl workflows are the bread and butter of automated browser-based crawling. A crawl workflow enables you to specify how and what the crawler should capture on a website.
A finished crawl results in an archived item that can be downloaded and shared. To easily identify and find archived items within your org, you can automatically name and tag archived items through custom workflow metadata.
You can create, view, search for, and run crawl workflows from the Crawling page.
Create a Crawl Workflow¶
Create new crawl workflows from the Crawling page, or the Create New ... shortcut from Dashboard.
Choose what to crawl¶
The first step in creating a new crawl workflow is to choose what you'd like to crawl by defining a Crawl Scope. Crawl scopes are categorized as a Page Crawl or Site Crawl.
Page Crawl¶
Choose one of these crawl scopes if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.
A Page Crawl workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.
Site Crawl¶
Choose one of these crawl scopes to have the the crawler automatically find pages based on a domain name, start page URL, or directory on a website.
Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.
After deciding what type of crawl you'd like to run, you can begin to set up your workflow. A detailed breakdown of available settings can be found in the workflow settings guide.
Run Crawl¶
Run a crawl workflow by clicking Run Crawl in the actions menu of the workflow in the crawl workflow list, or by clicking the Run Crawl button on the workflow's details page.
While crawling, the Latest Crawl section streams the current state of the browser windows as they visit pages. You can modify the crawl live by adding URL exclusions or changing the number of crawling instances.
Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated crawl scope.
Workflow Status¶
The status of the crawl workflow is updated as the workflow runs, or as a result of user intervention, or automatically when certain org-wide limits are reached.
Statuses may be displayed with a reason that details how the current status came to be.
| Status | Description |
|---|---|
| Waiting for Resources | The workflow is queued to run and is waiting for the computational resources needed to start the crawl. |
| Waiting: Reason | The workflow run is queued for one of the following reasons: At Crawl Limit: Org has reached maximum number of concurrent crawls Dedupe Index: An update to the deduplication index is in progress |
| Starting | The crawler is being initialized. Crawling will begin shortly. |
| Running | The crawler is visiting and archiving pages. |
| Pausing | The crawler has been instructed to pause and is finishing crawl of the current page. |
| Pausing (Finishing Downloads) | The crawler is finalizing downloads on the current page. |
| Pausing (Creating WACZ) | Pages crawled so far are being packaged into WACZ files and transferred to storage. |
| Paused | The workflow run has been paused by a user. It can be resumed for up to 7 days; afterwards, the run stops. |
| Paused: Reason | The workflow run has been paused automatically due to an enforced limit, as specified in the reason. |
| Resuming | The workflow run is starting back up after being paused. |
| Stopping | The crawler has been instructed to stop and is finishing crawl of the current page. |
| Finishing Downloads | The crawler is waiting for the current page to finish downloading to finalize the crawl. |
| Generating WACZ | Crawled pages are being packaged into WACZ files. |
| Uploading WACZ | WACZ files have been created and are being transferred to storage. |
| Complete | All pages within the workflow's scope and limits have been crawled and saved as WACZ, resulting in an archived item. |
| Stopped | The workflow run was stopped by a user and allowed to finish gracefully, resulting in an archived item. |
| Stopped: Paused Too Long | The workflow run was stopped automatically because it was not resumed within the given time limit. |
| Stopped: Reason | The workflow run was stopped automatically due to an enforced limit, as specified in the reason. |
| Canceled | The workflow run was canceled by a user; crawled content is discarded. |
| Skipped: Reason | The workflow run was skipped due to an enforced limit, as specified in the reason. |
| Failed | A serious error occurred while crawling causing the crawler to exit; no crawled content is saved. |
| Failed: Not Logged In | The crawler detected a logged out page and failed the crawl per Fail Crawl if Not Logged In setting. |
Enforced Limit Reasons¶
Workflow runs may be automatically paused, stopped, or skipped due to an enforced quota or limit. The status will always be displayed with a reason:
| Reason | Description |
|---|---|
| Storage Quota Reached | Disk space allocated for the org is full. |
| Time Quota Reached | All execution time allocated for the org has been spent. |
| Crawling Disabled | Crawling has been disabled for the entire org. |