Saturday, January 25, 2020

Tutorial: Back Up a Web Page or Web Site

Let's say there is a website you visit on a regular basis that contains useful information that you might not want to lose access to. This could happen for a variety of reasons, from the site owner deciding not to pay for hosting any more to a company deciding to replace the site with one that is supposed to be improved, but is instead more complicated.

Backing Up a Webpage for Yourself and Others with the Internet Archive Wayback Machine

The Internet Archive is a nonprofit organization dedicated to preserving digital data. One of their projects, called the Wayback Machine, keeps copies of webpages dating all the way back to 1996, and is considered trusted for citing a web page as it existed at a particular moment in time.

From the Wayback Machine homepage, you can use the large text box near the top center of the page to search for sites and pages that have already been archived. Entering a specific URL will open a calendar show all of the times that specific webpage has been archived, and you click on one of those dates to view a webpage how it existed at a specific moment in time. You can also prepend https://web.archive.org/*/ before a URL to view the list of archived versions of a page for a specific URL.

To back up a webpage and add it to the list of archived versions on the calendar page, you can use the small text box in the bottom right portion of the Wayback Machine homepage, or go directly to the Wayback Machine's Save Page Now tool. On that page, you can optionally check the "Save outlinks" box to also save the pages that are linked from the page URL you have entered. Once you click the "Save Page" button, the page you have specified will immediately and permanently be saved to the Wayback Machine for yourself and others to access.

Note: While some sites do block the Wayback Machine from displaying archived pages from their site, most do not.

Backing Up an Entire Website for Yourself and Others to the Internet Archive Wayback Machine

If you are aware of an important site that is disappearing soon and would like to save it to allow the public to access in the future, you can contact Archive Team on IRC by joining the #archiveteam channel on the HackInt IRC network. They have special tools that can download entire sites and then upload them to be accessible in the Wayback Machine.

Backing Up an Entire Website for Yourself Using WGET

If you want to back up a copy of a website for personal archiving or offline access purposes, you can use an open-source command-line utility called WGET. Most Linux distributions have it installed by default, and it is also available for macOS, Windows, and other platforms.

When you use this method, you are downloading a copy of a web site to your hard drive, and therefore will need enough disk space to store your copy of the website. In addition, your archive might not be considered a trusted citation by others in the future because it is easy to modify the content of the website you downloaded.

To start, open a command line to the folder to which you want to download your site. Then, you can type the following command:

wget --recursive --page-requisites --convert-links [website homepage address]

The --recursive switch tells WGET to search for links within a page, download them, then search those pages and download the links in those pages until all linked-to pages have been downloaded from that site.

The --page-requisites switch tells WGET to download any style, image, script, or other files that are needed to correctly display a webpage.

The --convert-links switch tells WGET to adjust the links in the pages after they are downloaded so they will work on your computer. Links to pages on the website are replaced with relative links on your computer.

Additional useful command switches:
--no-check-certificate is useful when your certificate trust store has problems or a site does not have a valid HTTPS configuration any more.
--no-parent prevents pages that are above your starting page's directory from being downloaded. Useful if the site you want to download shares a domain with other sites.
--timestamping allows you to update a previously-downloaded site by downloading only the pages on the server that are newer than the pages on your computer.
--mirror combines several of these command line switches
--span-hosts allows links to other sites to also be recursively downloaded. Be careful with this option, as you might end up downloading a large portion of the internet!

You can run wget --help for a full list of command switches.

Conclusion

There are many easy ways to save a copy of a webpage or website for the future. If you want to save a single webpage and optionally the webpages that are linked in that webpage, save it to the Internet Archive Wayback Machine. If there is an important website that may disappear soon, notify Archive Team. If you want a copy of a website for yourself, use WGET. The important thing is that you remember to archive the information before it is gone forever!

No comments:

Post a Comment