Saptarshi Sengupta

PhD Student in Informatics

I'm a PhD student at The Pennsylvania State University working on domain-specific applications of LLMs, under the guidance of my advisor, Dr. Suhang Wang. Outside of work, I am an advocate for animal rights 🦙, enjoy reading 📖 (currently on an Alan Turing biography), cooking 🍲 and playing the guitar 🎸.

Email GitHub Google Scholar LinkedIn X (Twitter)

Research Interests

My research interests span various aspects of NLP, including QA, RAG, IR/Search, LLM agents, model interpretability, and low-resource languages. Overall, I'm interested in applying language technologies to challenging edge cases that have either too much/little data. Through my work, I aim to develop methods for tackling real-world problems that are easy to use and cost-effective. You can find my research timeline described in this Google Slide.

Publications

Pre-Prints

BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions

Saptarshi Sengupta, Shuhua Yang, Paul Kwong Yu, Fali Wang, Suhang Wang

arXiv Code

ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers (Just accepted to EACL 2026 Main!)

Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng

arXiv Code (waiting for legal clearance to release code)

MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

Saptarshi Sengupta, Harsh Vashistha, Kristal Curtis, Akshay Mallipeddi, Abhinav Mathur, Joseph Ross, Liang Gou

arXiv

Published Work

TOP-Training: Target-Oriented Pretraining for Medical Extractive Question Answering

Saptarshi Sengupta, Connor Heaton, Shreya Ghosh, Wenpeng Yin, Preslav Nakov, Suhang Wang

International Conference on Computational Linguistics (COLING), 2025

Paper Code

Exploring Language Model Generalization in Low-Resource Extractive QA

Saptarshi Sengupta Wenpeng Yin, Preslav Nakov, Shreya Ghosh, Suhang Wang

International Conference on Computational Linguistics (COLING), 2025

Paper Code

Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings

Saptarshi Sengupta Connor Heaton, Suhan Cui, Soumalya Sarkar, Prasenjit Mitra

IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2024

Paper Code

Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language

Rajat Pandit, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, Mohini Mohan Sardar

Informatics journal, 2019

Paper

Word sense induction in bengali using parallel corpora and distributional semantics

Saptarshi Sengupta Rajat Pandit, Parag Mitra, Sudip Kumar Naskar, Mohini Mohan Sardar

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 2019

Paper

Writing

Feed-Forward Neural Network From Scratch

I've always wanted to implement a simple FFNN from scratch just to see how the math works and really understand things at a deeper level. This is my attempt at creating something from an educational perspective, breaking down all the math in bits to be more accessible. Note: All of the code works but some final illustrations remain.

Notebook

Experience

NLP and Large Language Model Intern

Robert Bosch LLC | May 2025 - August 2025

Performed research in tool retrieval for LLM-agents when dealing with a large number of tools. Proposed a new framework (ToolDreamer) for the same.

Machine Learning Applied Scientist Intern

Splunk | May 2024 - November 2024

Worked on synthetic data generation and LLM-agent trajectory verification for an internal AI assistant. Developed systems were implemented using the Autogen library

Last updated: October 2025