Intro to Archived Items¶
Archived items consist of one or more WACZ files created by a crawl workflow or uploaded to Browsertrix. They can be individually replayed, or combined with other archived items in a collection. The Archived Items page lists all items in the organization.
Uploading Web Archives¶
WACZ files can be given metadata and uploaded to Browsertrix by pressing the Upload WACZ button on the archived items list page. Only one WACZ file can be uploaded at a time.
Status¶
The status of an archived item depends on its type. Uploads will always have the status Uploaded, while crawls can have one of the following statuses:
| Status | Description |
|---|---|
| Complete | All pages within the crawl workflow's scope and limits have been crawled and are included in the archived item. |
| Stopped | Only the pages crawled until the workflow run had been stopped are included in the archived item. An optional reason for stopping may be displayed. |
Archived Item Details¶
The archived item details page is composed of the following sections, though some are only available for crawls and not uploads.
Overview¶
View metadata and statistics associated with how the archived item was created.
Metadata can be edited by pressing the pencil icon at the top right of the metadata section to edit the item's description, tags, and collections it is associated with.
Each archived item also has a unique identifier (ID) that can be used to reference the item in the Browsertrix API and in support requests. The ID may be prefixed with upload-, manual-, or sched-, followed by letters and numbers. The ID can be copied from the archived item Overview section or the Actions dropdown menu. Crawled items will also have an option to copy the item's workflow ID, which similarly can be used to reference the item’s workflow in the API and support requests.
Quality Assurance¶
View crawl quality information collected from analysis runs, review crawled pages, and start new analysis runs. QA is only available for crawls and org members with crawler permissions.
The pages list provides a record of all pages within the archived item, as well as any ratings or notes given to the page during review. If analysis has been run, clicking on a page in the pages list will go to that page in the review interface.
Crawl Analysis¶
Running crawl analysis will re-visit all pages within the archived item, comparing the data collected during analysis with the data collected during crawling. Crawl analysis runs with the same workflow limit settings used during crawling.
Crawl analysis can be run multiple times, though results should only differ if the crawler version has been updated between runs. The analysis process is being constantly improved and future analysis runs should produce better results. Analysis run data can be downloaded or deleted from the Analysis Runs tab. While they are stored as WACZ files, analysis run WACZs only contain analysis data and may not open correctly or be useful in other programs that replay archived content.
Once a crawl has been analyzed — either fully, or partially — it can be reviewed by pressing the Review Crawl button. For more on reviewing crawls and how to interpret analysis data, see: Crawl Review.
Paid Feature
Like running a crawl workflow, running crawl analysis also uses execution time. Crawls and crawl analysis share the same concurrent crawling limit, but crawl analysis runs will be paused in favor of new crawls if the concurrent crawling limit is reached.
Replay¶
View a high-fidelity replay of the website at the time it was archived.
For more details on navigating web archives within ReplayWeb.page, see the ReplayWeb.page user documentation.
WACZ Files¶
View downloadable files to save the archived item to a local device or to export the item from Browsertrix for use in another system.
One or more WACZ files may be present depending on the size and scale of the crawl. To combine them all into a single WACZ file, choose Export as Combined WACZ at the top of the tab. The combined file will automatically begin downloading.
Combining multiple WACZ files is the default behavior when choosing Download and Download Item from the archived item and workflow action menus, respectively.
For archived items that were created using deduplication, download the item as a combined WACZ with all of the dependencies from other items necessary for high fidelity replay by clicking the dropdown menu next to Export as Combined WACZ and selecting With Dependencies.
What is WACZ?
WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file. A WACZ file includes the data that is needed for the rendering of archived content as well as contextual information.
Logs¶
View a list of errors and behavior logs that were generated during crawling. Clicking a log entry in the list will reveal additional information.
Only a subset of the logs generated by the crawler are visible in this tab. All log entries that were recorded in the creation of the archived item can be downloaded in JSONL format by pressing the Download All Logs button.
Crawl Settings¶
View the crawl workflow configuration options that were used to generate the resulting archived item. Many of these settings also apply when running crawl analysis.