Globalizing Knowledge: Leveraging Large Language Models to Enhance Accessibility of ETDs

Yinlin Chen1
William A Ingram1
Edward A Fox2

1University Libraries, 2Department of Computer Science
Virginia Tech, Blacksburg, VA, USA



Electronic Theses and Dissertations (ETDs) encapsulate significant research findings and innovative ideas but often have limited visibility and accessibility, particularly in regions and disciplines with restricted digital reach. This workshop introduces an LLM-based application using a Retrieval-Augmented Generation (RAG) architectural approach to address these challenges. By utilizing LLMs to translate and standardize ETD metadata and content into a user’s native language, a unified vector database is established as a knowledge source for retrieving relevant information. This information is then supplied to the LLMs to generate comprehensive responses, enhancing searchability tailored to local or remote ETD collections. This approach improves the indexing and discoverability of ETDs and ensures accessibility across linguistic boundaries. During the workshop, we will present the details of this system's components, illustrating the program workflow and the interaction dynamics between the query, retrieval, and response generation phases. Participants will learn how to integrate these technologies into their digital library systems and repositories, adapting them to various institutional needs to enhance their ETD collections' global visibility and utility.



ETD 2024 Workshop 2

Learning Objectives

By the end of this workshop, participants will be able to:

  1. Understand the potential of Large Language Models (LLMs) in improving ETD accessibility and discoverability
  2. Grasp the fundamentals of Retrieval-Augmented Generation (RAG) architecture
  3. Learn how to implement LLM-based solutions for translating and standardizing ETD metadata and content
  4. Develop skills to create and query unified vector databases for ETD collections
  5. Gain practical experience in integrating LLM technologies into existing digital library systems

Target Audience

This workshop is designed for:

  • Librarians and information professionals managing ETD collections
  • Researchers and graduate students interested in enhancing the visibility of their work
  • Digital repository managers and developers
  • Academic institutions looking to improve their ETD accessibility globally

Workshop Format

The workshop will employ a mixed format, combining lectures, interactive discussions, and hands-on sessions:

  1. Lecture segments (90 minutes)
  2. Hands-on practical sessions (90 minutes)
  3. Discussions (30 minutes)

Technical Requirements

  • High-speed internet connection
  • Participants should bring laptops

Estimated Duration

Total duration: 4 hours

  • Introduction and overview (10 minutes)
  • Lecture: LLMs and RAG architecture basics (40 minutes)
  • Break (10 minutes)
  • Lecture: ETD metadata standardization and translation (40 minutes)
  • Break (10 minutes)
  • Hands-on session: Implementing LLM solutions (90 minutes)
  • Break (10 minutes)
  • Discussion: Challenges and opportunities (30 minutes)

Workshop Outline

1. Introduction and Overview (10 minutes)

  • Welcome and introductions
  • Workshop objectives and agenda

2. Lecture: LLMs and RAG Architecture Basics (40 minutes)

  • Introduction to Large Language Models
  • Overview of Retrieval-Augmented Generation (RAG)
  • Applications in ETD accessibility

3. Break (10 minutes)

4. Lecture: ETD Metadata Standardization and Translation (40 minutes)

  • Challenges in ETD metadata consistency
  • LLM-based approaches to metadata standardization
  • Multilingual translation of ETD content

5. Break (10 minutes)

6. Hands-on Session: Implementing LLM Solutions (90 minutes)

  • Setting up the development environment
  • Building a simple RAG system for ETDs
  • Creating and querying vector databases
  • Integrating translation capabilities
  • Testing and evaluating the system

7. Break (10 minutes)

8. Discussion: Challenges and Opportunities (30 minutes)

  • Group discussion on implementing LLM solutions
  • Addressing potential ethical concerns
  • Q&A session

Biography of Workshop Leaders

Yinlin Chen holds a Ph.D. in Computer Science and Applications from Virginia Tech, and a M.S. and a B.S. in Computer Science from National Tsing Hua University, Taiwan. He is an Assistant Director of the Center for Digital Research & Scholarship and an Assistant Professor at the Virginia Tech Libraries. His professional interests include Digital Libraries, Machine Learning, Artificial Intelligence, and Cloud Computing.

William A Ingram is an Associate Professor at Virginia Tech and serves as Associate Dean and Executive Director for Information Technologies in the University Libraries. He holds a B.A. in Cognitive Science from the University of Virginia and an M.S. in Library and Information Science from the University of Illinois at Urbana-Champaign. Ingram's research focuses on digital libraries and information retrieval, particularly applying machine learning and AI to improve access to digital collections. He is also instrumental in organizing workshops on AI for libraries and cultural heritage organizations, with an emphasis on ethics and bias mitigation.

Edward A Fox is a Professor Computer Science at Virginia Tech, where he directs the Digital Library Research Laboratory. Since 1983 he has taught courses on digital libraries, information retrieval, multimedia/hypertext/information access, etc. He is a Fellow of ACM, IEEE, AIIA, and AAIA. His degrees are from MIT (BS) and Cornell University (MS, Ph.D.). He serves as Executive Director and Chairman of the Board of the Networked Digital Library of Theses and Dissertations (NDLTD). He collaborates with Yinlin Chen and William Ingram on IMLS grants related to the topics of this workshop.



This workshop is a collaboration between University Libraries and the Department of Computer Science, funded by the Institute of Museum and Library Services (IMLS).