ETD24 Workshop2

Globalizing Knowledge: Leveraging Large Language Models to Enhance Accessibility of ETDs

Yinlin Chen1

William A Ingram1

Edward A Fox2

1University Libraries, 2Department of Computer Science
Virginia Tech, Blacksburg, VA, USA

Electronic Theses and Dissertations (ETDs) encapsulate significant research findings and innovative ideas but often have limited visibility and accessibility, particularly in regions and disciplines with restricted digital reach. This workshop introduces an LLM-based application using a Retrieval-Augmented Generation (RAG) architectural approach to address these challenges. By utilizing LLMs to translate and standardize ETD metadata and content into a user’s native language, a unified vector database is established as a knowledge source for retrieving relevant information. This information is then supplied to the LLMs to generate comprehensive responses, enhancing searchability tailored to local or remote ETD collections. This approach improves the indexing and discoverability of ETDs and ensures accessibility across linguistic boundaries. During the workshop, we will present the details of this system's components, illustrating the program workflow and the interaction dynamics between the query, retrieval, and response generation phases. Participants will learn how to integrate these technologies into their digital library systems and repositories, adapting them to various institutional needs to enhance their ETD collections' global visibility and utility.

ETD 2024 Workshop 4

Workshop slides

Learning Objectives

By the end of this workshop, participants will be able to:

Understand the potential of Large Language Models (LLMs) in improving ETD accessibility and discoverability
Grasp the fundamentals of Retrieval-Augmented Generation (RAG) architecture
Learn how to implement LLM-based solutions for translating and standardizing ETD metadata and content
Develop skills to create and query unified vector databases for ETD collections
Gain practical experience in integrating LLM technologies into existing digital library systems

Target Audience

This workshop is designed for:

Librarians and information professionals managing ETD collections
Researchers and graduate students interested in enhancing the visibility of their work
Digital repository managers and developers
Academic institutions looking to improve their ETD accessibility globally

Workshop Format

The workshop will employ a mixed format, combining lectures, interactive discussions, and hands-on sessions:

Lecture segments
Hands-on practical sessions
Discussions

Technical Requirements

High-speed internet connection
Participants should bring laptops

Workshop Outline

1. Introduction and Overview

Welcome and introductions
Workshop objectives and agenda

2. Lecture: LLMs and RAG Architecture Basics

Introduction to Large Language Models
Overview of Retrieval-Augmented Generation (RAG)
Applications in ETD accessibility

3. Lecture: ETD Metadata Standardization and Translation

Challenges in ETD metadata consistency
LLM-based approaches to metadata standardization

4. Break (15 minutes)

5. Hands-on Session: Implementing LLM Solutions

Setting up the development environment
Building a simple RAG system for ETDs
Creating and querying vector databases
Integrating translation capabilities
Testing and evaluating the system

6. Discussion: Challenges and Opportunities

Group discussion on implementing LLM solutions
Addressing potential research directions and challenges
Q&A session

Biography of Workshop Leaders

Yinlin Chen holds a Ph.D. in Computer Science and Applications from Virginia Tech, and a M.S. and a B.S. in Computer Science from National Tsing Hua University, Taiwan. He is an Assistant Director of the Center for Digital Research & Scholarship and an Assistant Professor at the Virginia Tech Libraries. His professional interests include Digital Libraries, Machine Learning, Artificial Intelligence, and Cloud Computing.

William A Ingram is an Associate Professor at Virginia Tech and serves as Associate Dean and Executive Director for Information Technologies in the University Libraries. He holds a B.A. in Cognitive Science from the University of Virginia and an M.S. in Library and Information Science from the University of Illinois at Urbana-Champaign. Ingram's research focuses on digital libraries and information retrieval, particularly applying machine learning and AI to improve access to digital collections. He is also instrumental in organizing workshops on AI for libraries and cultural heritage organizations, with an emphasis on ethics and bias mitigation.

Edward A Fox is a Professor Computer Science at Virginia Tech, where he directs the Digital Library Research Laboratory. Since 1983 he has taught courses on digital libraries, information retrieval, multimedia/hypertext/information access, etc. He is a Fellow of ACM, IEEE, AIIA, and AAIA. His degrees are from MIT (BS) and Cornell University (MS, Ph.D.). He serves as Executive Director and Chairman of the Board of the Networked Digital Library of Theses and Dissertations (NDLTD). He collaborates with Yinlin Chen and William Ingram on IMLS grants related to the topics of this workshop.

This workshop is a collaboration between University Libraries and the Department of Computer Science, funded by the Institute of Museum and Library Services (IMLS).