*Result*: Extracting A Large Corpus from the Internet Archive, A Case Study.

Title:
Extracting A Large Corpus from the Internet Archive, A Case Study.
Authors:
Weig, Eric C. (AUTHOR) eweig@uky.edu
Source:
Code4Lib Journal. 2025, Issue 61, pN.PAG-N.PAG. 1p.
Database:
Library, Information Science & Technology Abstracts

*Further Information*

*The Internet Archive was founded on May 10, 1996, in San Francisco, CA. Since its inception, the archive has amassed an enormous corpus of content, including over 866 billion web pages, more than 42.5 million print materials, 13 million videos, and 14 million audio files. It is relatively easy to upload content to the Internet Archive. It is also easy to download individual objects by visiting their pages and clicking on specific links. However, downloading a large collection, such as thousands or even tens of thousands of items, is not as easy. This article outlines how The University of Kentucky Libraries downloaded over 86 thousand previously uploaded newspaper issues from the Internet Archive for local use. The process leveraged ChatGPT to automate the process of generating Python scripts that accessed the Internet Archive via its API (Application Programming Interface). [ABSTRACT FROM AUTHOR]*