Web Curator Tool

WCT logo






 

About Web Curator Tool

The Web Curator Tool (WCT) is a tool for acquiring web material, such as websites, web pages, and other documents you might find on the internet.

Web Harvesting and WCT

The National Library of New Zealand runs a selective web harvesting programme using the Web Curator Tool. Websites harvested by this method are deposited into the Library’s digital archive (Rosetta) and are then available to researchers via our main delivery channels.

The tool enables a user to enter descriptive and administrative metadata for a website, schedule and run a web crawl on that site and review the archived content.The collected web material is then stored and preserved in the digital archive.


 Web Harvests into preservation system

 

                        Web harvest into preservation system

The Web Curator Tool was developed in 2006 as a collaborative effort by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium (IIPC). WCT is available under the terms of the Apache Public License. The WCT is written in Java and designed to run in Apache Tomcat. It has a flexible architecture, allowing the components of the tool to be distributed over multiple servers.

The Web Curator Tool was released as open-source software and can be downloaded from GitHub http://dia-nz.github.io/webcurator/ (external link) .

It is designed for use in libraries and other collecting organisations. It supports collection by non-technical users while still allowing complete control of the web harvesting process.

The Web Curator Tool supports:

  • Harvest authorisation - obtaining permission to harvest web material and make it accessible;
  • Selection, scoping and scheduling  - deciding what to harvest, how, and when;
  • Basic description - adding unqualified Dublin Core metadata and web-specific notes;
  • Harvesting - downloading the selected material from the internet;
  • Quality review - ensuring the harvested material is ready to archive; and
  • Archiving - submitting harvest results to a digital archive.

Archiving to Rosetta

When archiving,

  1. The WCT packages up the web harvest in a SIP structure along with a METs xml file.
  2. It authenticates with Rosetta via the PDS login to get a PDS handle,
  3. Transfers the files to a secure location via FTP,
  4. Then makes a deposit web service call to Rosetta to make the SIP submission (including the FTP folder),
  5. Rosetta then returns the SIP ID on success.

 

 

WCT user manuals

Before using the tool, we recommend you read the manual which you can download from GitHub link https://github.com/DIA-NZ/webcurator/wiki/Documentation (external link)

 

Back to top