About Me

My name is Hongjun(Jan) Liu. I’m an undergraduate at CHU KOCHENHONORS COLLEGE of Zhejiang University with dual bachelor’s degrees in both Computer Science and Environmental Resource Management.

My research interest includes Knowledge-Intensive Question Answering, Retrieval Augmented LLM, and NLP+Science.

I am very fortunate to be advised by Prof. Armon Cohan of Yale University & Allen Institute for AI (AI2). I am also advised by Prof. Chen Zhao from NYU Shanghai & New York University.

You can find my CV here: Hongjun’s Curriculum Vitae.

Email / Github / Twitter

My Research Experience

The following sections will introduce some of my explorations in scientific research so far. If there are details you find interesting that are not elaborated upon below, please do not hesitate to contact me.

New York University Research Assistant

Project: Developing and Enhancing Language Models for Improved Counterfactual Claim Processing and Evidence Retrieval

Advisors: Chen Zhao, Nov 2023 - Present

• Utilize a large language model (LLM) like GPT-4 to transform counterfactual claims into retriever-friendly versions through decomposition, rephrasing, or similar methods, and subsequently address these reformulated claims.

• Accumulate pairs of original and counterfactual claims to train an editing model, focusing on refining models such as LLAMA2 for this purpose.

• Enhance the capabilities of a dense retriever by fine-tuning it, thereby facilitating more efficient evidence discovery using claims edited by the LLM.

Yale University & New York University Research Assistant

Project: Extending Capabilities of Large Language Models for Knowledge-Intensive Financial Exam QA

Advisors: Arman Cohan & Chen Zhao, July 2023 - Present

• Construct a dataset about Complex Financial Exam QA with a hybrid of textual and tabular content and require college-level knowledge in the finance domain for effective resolution.

• Provide expert-annotated, detailed solution references in Python program format for each QA in the dataset, ensuring a high-quality benchmark for LLM assessment.

• Evaluate a wide spectrum of LLMs on our newly constructed dataset with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. The current best-performing system (i.e., GPT-4 with Program-of-Thoughts) achieves only 45.4% accuracy, leaving substantial room for improvement.

• Use the question as the retrieval query, acting as a knowledge retrieval module to retrieve the top-n knowledge terms with the highest similarities from our constructed knowledge bank and enhance the capabilities of LLMs for solving knowledge-intensive hybrid QA.

• Enhance the proficiency of language models such as LLAMA2 to augment the performance of large language models (LLMs) in addressing complex, knowledge-intensive hybrid question-answering tasks.

Alibaba DAMO Academy (Research Apartment) Internship

Project: Use Machine Learning to Establish a Mapping Relationship from Genetic Variation to Phenotypic Diversity

Advisor: Jieping Ye, Oct 2022 - June 2023

• Spearheaded the annotation of 3D scan data for the skulls of bird species, identifying key points and lines to delineate structural features. Conducted preprocessing of the point cloud data, involving detailed annotation of specific features and conversion of data formats for enhanced usability and analysis.

• Implemented a systematic approach for correlating genomic data with phenotypic traits, focusing on beak development genes and their association with beak width, length, and body mass.

• Applied statistical and machine learning methods commonly used in Bio career, including PCA for dimensionality reduction, ANOVA for inter-group differences, and various regression models (simple linear, polynomial, Lasso & Group Lasso) to uncover complex relationships.

• Conducted gene annotation and feature extraction from protein sequences, leveraging an array of trained models such as ProteinBert-Tape and MSA Transformer to extract significant information regarding protein structure and function.

• Collaborated in a multidisciplinary team to rank genes based on their correlation with phenotype data, integrating ecological measurement data for comprehensive analysis.

• Establish a phenotypic prediction model, which linked relationships between multiple genes and phenotype traits and achieved 84% accuracy.

Zhejiang University Student Research Program

Project: Financial Statement Fraud and Valuation Misestimation Issues Detection

Advisor: Xili Zhang, Sept 2021 - Feb 2022

• Constructed a dataset based on the financial reports of companies over the past ten years.

• By performing a chi-square test, five indicators of accounts receivable turnover, inventory turnover, cash ratio, intellectual property ratio, and current ratio were identified as significant factors for detecting companies that engage in financial statement fraud. These factors were then used to establish data classification labels.

• Based on restructured data, evaluate various kinds of recognition models for value estimation, such as linear regression, logistic regression, decision trees, random forests, and neural networks.

Imperial College London Data Science Online Winter School 2022

Winter School, Jan 2022 – June 2022

• Acquired a fundamental understanding of machine learning and data science

• Fine-tuned the model’s parameters to enable the use of machine learning for identifying brain tumor information.

• Standardize and clean the data to remove irrelevant information, normalize medical terminology, and handle abbreviations and synonyms.

• Validate the NLP strategies through clinical trials to ensure they are effective and safe.

For more info

Maybe you can contact me by Email