ArXiv Papers with Open Source Code

Information Retrieval & Database Research Papers

Last updated: 2025-07-16 01:56 UTC

50

Total Papers

50

With GitHub

14

Last 7 Days

45

Information Retrieval

5

Databases

Non-parametric Graph Convolution for Re-ranking in Recommendation Systems

Zhongyu Ouyang, Mingxuan Ju, Soroush Vosoughi, Yanfang Ye 2025-07-14 Information Retrieval

Graph knowledge has been proven effective in enhancing item rankings in recommender systems (RecSys), particularly during the retrieval stage. However, its application in the ranking stage, especially when richer contextual information in user-item interactions is available, remains underexplored. A...

Generative Cognitive Diagnosis

Jiatong Li, Qi Liu, Mengxiao Zhu 2025-07-13 Information Retrieval

Cognitive diagnosis (CD) models latent cognitive states of human learners by analyzing their response patterns on diagnostic tests, serving as a crucial machine learning technique for educational assessment and evaluation. Traditional cognitive diagnosis models typically follow a transductive predic...

Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching

Junyu Chen, Yihua Gao, Mingyuan Ge, Mingyong Li 2025-07-12 Information Retrieval

Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences ...

DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate

Anthony Miyaguchi, Conor Johnston, Aaryan Potdar 2025-07-12 Information Retrieval

Large Language Models (LLMs) demonstrate strong conversational abilities. In this Working Paper, we study them in the context of debating in two ways: their ability to perform in a structured debate along with a dataset of arguments to use and their ability to evaluate utterances throughout the deba...

DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval

Anthony Miyaguchi, Imran Afrulbasha, Aleksandar Pramov 2025-07-11 Information Retrieval

Information Retrieval (IR) models are often trained on static datasets, making them vulnerable to performance degradation as web content evolves. The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025, which evaluates IR systems across...

Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification

Jason Kahei Tam, Murilo Gustineli, Anthony Miyaguchi 2025-07-11 Information Retrieval

Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGV...

DTECT: Dynamic Topic Explorer & Context Tracker

Suman Adhya, Debarshi Kumar Sanyal 2025-07-10 Information Retrieval

The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We intr...

DTECT: Dynamic Topic Explorer & Context Tracker

Suman Adhya, Debarshi Kumar Sanyal 2025-07-10 Information Retrieval

The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We intr...

Boosting Parameter Efficiency in LLM-Based Recommendation through Sophisticated Pruning

Shanle Zheng, Keqin Bao, Jizhi Zhang, Yang Zhang, Fuli Feng, Xiangnan He 2025-07-09 Information Retrieval

LLM-based recommender systems have made significant progress; however, the deployment cost associated with the large parameter volume of LLMs still hinders their real-world applications. This work explores parameter pruning to improve parameter efficiency while maintaining recommendation quality, th...

CDC: Causal Domain Clustering for Multi-Domain Recommendation

Huishi Luo, Yiqing Wu, Yiwen Chen, Fuzhen Zhuang, Deqing Wang 2025-07-09 Information Retrieval

Multi-domain recommendation leverages domain-general knowledge to improve recommendations across several domains. However, as platforms expand to dozens or hundreds of scenarios, training all domains in a unified model leads to performance degradation due to significant inter-domain differences. Exi...

Shifting from Ranking to Set Selection for Retrieval Augmented Generation

Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee 2025-07-09 Information Retrieval

Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs...

Temporal Information Retrieval via Time-Specifier Model Merging

SeungYoon Han, Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, Huije Lee, Jong C. Park 2025-07-09 Information Retrieval

The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries ...

MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval

Naoya Sogi, Takashi Shibata, Makoto Terao, Masanori Suganuma, Takayuki Okatani 2025-07-09 Information Retrieval

Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the ...

DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse

Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil 2025-07-09 Information Retrieval

Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers ...

Unconditional Diffusion for Generative Sequential Recommendation

Yimeng Bai, Yang Zhang, Sihao Ding, Shaohui Ruan, Han Yao, Danhui Guan, Fuli Feng, Tat-Seng Chua 2025-07-08 Information Retrieval

Diffusion models, known for their generative ability to simulate data creation through noise-adding and denoising processes, have emerged as a promising approach for building generative recommenders. To incorporate user history for personalization, existing methods typically adopt a conditional diff...

Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification

Murilo Gustineli, Anthony Miyaguchi, Adrian Cheung, Divyansh Khattak 2025-07-08 Information Retrieval

We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the n...

When Transformers Meet Recommenders: Integrating Self-Attentive Sequential Recommendation with Fine-Tuned LLMs

Kechen Liu 2025-07-08 Information Retrieval

Self-Attentive Sequential Recommendation (SASRec) effectively captures long-term user preferences by applying attention mechanisms to historical interactions. Concurrently, the rise of Large Language Models (LLMs) has motivated research into LLM-based recommendation, which leverages their powerful g...

From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation

Guohao Li, Li Jing, Jia Wu, Xuefei Li, Kai Zhu, Yue He 2025-07-08 Information Retrieval

Most existing multimodal collaborative filtering recommendation (MCFRec) methods rely heavily on ID features and multimodal content to enhance recommendation performance. However, this paper reveals that ID features are effective but have limited benefits in multimodal collaborative filtering recomm...

FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation

Maolin Wang, Yutian Xiao, Binhao Wang, Sheng Zhang, Shanshan Ye, Wanyu Wang, Hongzhi Yin, Ruocheng Guo, Zenglin Xu 2025-07-07 Information Retrieval

Modern recommendation systems face significant challenges in processing multimodal sequential data, particularly in temporal dynamics modeling and information flow coordination. Traditional approaches struggle with distribution discrepancies between heterogeneous features and noise interference in m...

Hierarchical Intent-guided Optimization with Pluggable LLM-Driven Semantics for Session-based Recommendation

Jinpeng Chen, Jianxiang He, Huan Li, Senzhang Wang, Yuan Cao, Kaimin Wei, Zhenye Yang, Ye Ji 2025-07-07 Information Retrieval

Session-based Recommendation (SBR) aims to predict the next item a user will likely engage with, using their interaction sequence within an anonymous session. Existing SBR models often focus only on single-session information, ignoring inter-session relationships and valuable cross-session insights....

Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou 2025-07-03 Information Retrieval

Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: th...

Listwise Preference Alignment Optimization for Tail Item Recommendation

Zihao Li, Chao Yang, Tong Zhang, Yakun Chen, Xianzhi Wang, Guandong Xu, Daoyi Dong 2025-07-03 Information Retrieval

Preference alignment has achieved greater success on Large Language Models (LLMs) and drawn broad interest in recommendation research. Existing preference alignment methods for recommendation either require explicit reward modeling or only support pairwise preference comparison. The former directly ...

Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Georgii Levtsov, Dmitry Ustalov 2025-07-02 Information Retrieval

With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This pape...

Uncertainty-Aware Complex Scientific Table Data Extraction

Kehinde Ajayi, Yi He, Jian Wu 2025-07-02 Information Retrieval

Table structure recognition (TSR) and optical character recognition (OCR) play crucial roles in extracting structured data from tables in scientific documents. However, existing extraction frameworks built on top of TSR and OCR methods often fail to quantify the uncertainties of extracted results. T...

PDFMathTranslate: Scientific Document Translation Preserving Layouts

Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma 2025-07-02 Information Retrieval

Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for...

Uncertainty-Aware Complex Scientific Table Data Extraction

Kehinde Ajayi, Yi He, Jian Wu 2025-07-02 Information Retrieval

Table structure recognition (TSR) and optical character recognition (OCR) play crucial roles in extracting structured data from tables in scientific documents. However, existing extraction frameworks built on top of TSR and OCR methods often fail to quantify the uncertainties of extracted results. T...

MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models

Jianghao Lin, Xinyuan Wang, Xinyi Dai, Menghui Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang 2025-07-01 Information Retrieval

Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optim...

Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System

Yongsen Zheng, Zongxuan Xie, Guohua Wang, Ziyao Liu, Liang Lin, Kwok-Yan Lam 2025-07-01 Information Retrieval

Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts,...

Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen 2025-06-30 Information Retrieval

Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effec...

JointRank: Rank Large Set with Single Pass

Evgeny Dedov 2025-06-27 Information Retrieval

Efficiently ranking relevant items from large candidate pools is a cornerstone of modern information retrieval systems -- such as web search, recommendation, and retrieval-augmented generation. Listwise rerankers, which improve relevance by jointly considering multiple candidates, are often limited ...

skLEP: A Slovak General Language Understanding Benchmark

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko 2025-06-26 Information Retrieval

In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a ...

Small Encoders Can Rival Large Decoders in Detecting Groundedness

Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar 2025-06-26 Information Retrieval

Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal know...

Response Quality Assessment for Retrieval-Augmented Generation via Conditional Conformal Factuality

Naihe Feng, Yi Sui, Shiyi Hou, Jesse C. Cresswell, Ga Wu 2025-06-26 Information Retrieval

Existing research on Retrieval-Augmented Generation (RAG) primarily focuses on improving overall question-answering accuracy, often overlooking the quality of sub-claims within generated responses. Recent methods that attempt to improve RAG trustworthiness, such as through auto-evaluation metrics, l...

EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora

Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou 2025-06-26 Information Retrieval

Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their sc...

EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora

Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou 2025-06-26 Information Retrieval

Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their sc...

RAG-VisualRec: An Open Resource for Vision- and Text-Enhanced Retrieval-Augmented Generation in Recommendation

Ali Tourani, Fatemeh Nazary, Yashar Deldjoo 2025-06-25 Information Retrieval

This paper addresses the challenge of developing multimodal recommender systems for the movie domain, where limited metadata (e.g., title, genre) often hinders the generation of robust recommendations. We introduce a resource that combines LLM-generated plot descriptions with trailer-derived visual ...

ReCode: Updating Code API Knowledge with Reinforcement Learning

Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang 2025-06-25 Information Retrieval

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes ...

CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems

Haochen Zhang, Tianyi Zhang, Junze Yin, Oren Gal, Anshumali Shrivastava, Vladimir Braverman 2025-06-24 Information Retrieval

Recommender systems play a pivotal role in providing relevant content to users. With the rapid development of large language models (LLMs), researchers have begun utilizing LLMs to build more powerful recommender systems. However, existing approaches that focus on aligning LLMs with recommendation t...

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu 2025-06-23 Information Retrieval

Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language ...

Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation

Jingming Liu, Yumeng Li, Wei Shi, Yao-Xiang Ding, Hui Su, Kun Zhou 2025-06-23 Information Retrieval

Recent studies have proposed leveraging Large Language Models (LLMs) as information retrievers through query rewriting. However, for challenging corpora, we argue that enhancing queries alone is insufficient for robust semantic matching; the LLM should also have sufficient understanding of the corpu...

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu 2025-06-23 Information Retrieval

Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language ...

Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

Duong Bach 2025-06-19 Information Retrieval

Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs due to their reliance on high-dimensional patch embeddings and late-interaction scoring. To address these challenges, we propose HPC-ColP...

DiscRec: Disentangled Semantic-Collaborative Modeling for Generative Recommendation

Chang Liu, Yimeng Bai, Xiaoyan Zhao, Yang Zhang, Fuli Feng, Wenge Rong 2025-06-18 Information Retrieval

Generative recommendation is emerging as a powerful paradigm that directly generates item predictions, moving beyond traditional matching-based approaches. However, current methods face two key challenges: token-item misalignment, where uniform token-level modeling ignores item-level granularity tha...

Multi-Interest Recommendation: A Survey

Zihao Li, Qiang Chen, Lixin Zou, Aixin Sun, Chenliang Li 2025-06-18 Information Retrieval

Existing recommendation methods often struggle to model users' multifaceted preferences due to the diversity and volatility of user behavior, as well as the inherent uncertainty and ambiguity of item attributes in practical scenarios. Multi-interest recommendation addresses this challenge by extract...

Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese Regulations

Chia-Heng Yu, Yen-Lung Tsai 2025-06-16 Information Retrieval

Traditional Retrieval-Augmented Generation (RAG) systems employ brute-force inner product search to retrieve the top-k most similar documents, then combined with the user query and passed to a language model. This allows the model to access external knowledge and reduce hallucinations. However, sele...

Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task

Wuzhenghong Wen, Su Pan, yuwei Sun 2025-06-13 Databases

Schema linking is a critical step in Text-to-SQL task, aiming to accurately predict the table names and column names required for the SQL query based on the given question. However, current fine-tuning approaches for schema linking models employ a rote-learning paradigm, excessively optimizing for g...

Bridging RDF Knowledge Graphs with Graph Neural Networks for Semantically-Rich Recommender Systems

Michael Färber, David Lamprecht, Yuni Susanti 2025-06-10 Databases

Graph Neural Networks (GNNs) have substantially advanced the field of recommender systems. However, despite the creation of more than a thousand knowledge graphs (KGs) under the W3C standard RDF, their rich semantic information has not yet been fully leveraged in GNN-based recommender systems. To ad...

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska 2025-06-06 Databases

Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical exp...

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish 2025-06-05 Databases

Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have ...

VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs

Jingyi Xi, Chenghao Mo, Benjamin Karsin, Artem Chirkin, Mingqin Li, Minjia Zhang 2025-06-01 Databases

Vector search and database systems have become a keystone component in many AI applications. While many prior research has investigated how to accelerate the performance of generic vector search, emerging AI applications require running more sophisticated vector queries efficiently, such as vector s...