*license_notice_EDS_guest_loggedout* *Login*

Result: Extracting A Large Corpus from the Internet Archive, A Case Study.

Title:

Extracting A Large Corpus from the Internet Archive, A Case Study.

Authors:

Weig, Eric C. (AUTHOR) eweig@uky.edu

Source:

Code4Lib Journal. 2025, Issue 61, pN.PAG-N.PAG. 1p.

Subject Terms:

*Web archives, *Application program interfaces, *Data extraction, *Research libraries, ChatGPT, Automation software, Python programming language, News websites

Database:

Library, Information Science & Technology Abstracts

Further Information

*The Internet Archive was founded on May 10, 1996, in San Francisco, CA. Since its inception, the archive has amassed an enormous corpus of content, including over 866 billion web pages, more than 42.5 million print materials, 13 million videos, and 14 million audio files. It is relatively easy to upload content to the Internet Archive. It is also easy to download individual objects by visiting their pages and clicking on specific links. However, downloading a large collection, such as thousands or even tens of thousands of items, is not as easy. This article outlines how The University of Kentucky Libraries downloaded over 86 thousand previously uploaded newspaper issues from the Internet Archive for local use. The process leveraged ChatGPT to automate the process of generating Python scripts that accessed the Internet Archive via its API (Application Programming Interface). [ABSTRACT FROM AUTHOR]*

*Result*: Extracting A Large Corpus from the Internet Archive, A Case Study.

*Further Information*

*Links*

*Additional functions*

Result: Extracting A Large Corpus from the Internet Archive, A Case Study.

Further Information

Links

Additional functions