Zheng Zhao, Yftah Ziser, Shay Cohen
EMNLP 2024
This paper investigates how instruction-tuned LLMs internally process different tasks, finding that their layers organize into three functional groups: early layers for general features, middle layers for task-specific transitions, and final layers for refinement.
Pinzhen Chen, Zheng Zhao, Shun Shao
Arabic Natural Language Processing Conference (ArabicNLP) 2024
This paper describes Team Cher's submission to the ArabicNLP 2024 KSAA-CAD shared task on reverse dictionary for Arabic uses a multi-task learning framework that combines reverse dictionary, definition generation, and reconstruction tasks. This method, which examines various tokenization strategies and embedding architectures, achieves strong results using only the provided training data.
Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Ponti, Shay Cohen
NeurIPS 2024
We introduce Spectral Editing of Activations (SEA), a novel inference-time method to adjust large language models' internal representations, improving truthfulness and reducing bias. SEA projects input representations to align with positive examples while minimizing alignment with negatives, showing superior effectiveness, generalization, and efficiency compared to existing methods with minimal impact on other model capabilities.
Zheng Zhao, Emilio Monti, Jens Lehmann, Haytham Assem
NAACL 2024 Oral
This work introduces a novel approach integrating contrastive decoding with adversarial irrelevant passages as negative samples to enhance robust context grounding during generation and operates at inference time without requiring further training.
Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Ponti, Shay Cohen
NAACL 2024
We assess whether Large Language Models (LLMs) like LLaMA 2 and GPT-4 have a coherent temporal understanding by evaluating their common-sense knowledge of event structure, ordering, and self-consistency. Our findings reveal that LLMs perform significantly worse than humans and specialized models, struggle with self-consistency, and show limited improvement with larger sizes or advanced techniques. We conclude that current LLMs lack a consistent temporal model, partly due to weak temporal information in their training data.
Manuj Malik*, Zheng Zhao*, Marcio Fonseca*, Shrisha Rao, Shay Cohen (* equal contribution)
SIGIR 2024
We present CivilSum, a dataset of 23,350 legal case decisions from Indian courts with human-written abstractive summaries, offering a challenging benchmark for legal summarization. Our analysis highlights that crucial content often appears at the end of documents, underscoring the need for effective long-sequence modeling in summarization.
Ashok Urlana*, Pinzhen Chen*, Zheng Zhao, Shay Cohen, Manish Shrivastava, Barry Haddow (* equal contribution)
EMNLP (Findings) 2023
We introduce PMIndiaSum, a massively parallel and multilingual summarization corpus for Indian languages, covering 14 languages and 196 language pairs. We detail our data construction process and provide benchmarks for various summarization methods, demonstrating the dataset's significant impact on Indian language summarization.
Zheng Zhao, Yftah Ziser, Bonnie Webber, Shay Cohen
EMNLP (Findings) 2023
This work presents an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models, and finds the factorization outputs exhibit strong associations with performance observed across different cross-lingual tasks.
Zheng Zhao, Yftah Ziser, Shay Cohen
BlackboxNLP 2022
We examine how different domains are represented in neural network architectures, focusing on the relationship between domains, model size, and training data. Using subpopulation analysis with SVCCA on Transformer-based language models, we compare models trained on multiple domains versus a single domain. Our findings show that increasing model capacity differently affects domain information storage in upper and lower layers, with larger models embedding domain-specific information similarly to separate smaller models.
AACL-IJCNLP 2022
We developed a dual-way neural dictionary that retrieves words from definitions and generates definitions for words, learning both tasks simultaneously using shared embeddings. Our model performs well on benchmarks and is preferred by human evaluators, demonstrating its practical effectiveness and the benefits of multiple learning objectives.
Zheng Zhao#, Pinzhen Chen (# corresponding author)
Chinese National Conference on Computational Linguistics (CCL) 2022
In this work, multifaceted investigations on fine-tuning and adapters for summarization tasks with varying complexity: language, domain, and task transfer provide insights on multilinguality, model convergence, and robustness, hoping to shed light on the pragmatic choice of fine-tuning or adapters in abstractive summarization.
International Workshop on Semantic Evaluation (SemEval) 2022 Honourable Mention for Best System Paper
This paper describes a winning entry for the SemEval 2022 Task 1 on reverse dictionary and definition modeling, using a unified model with multi-task training. The system performs consistently across languages, excelling with sgns embeddings, but reveals that definition generation quality remains challenging and BLEU scores may be misleading.
Workshop on Computational Approaches to Discourse (CODI) 2021
In PDTB-3, thousands of new implicit discourse relations were annotated within sentences, complicating the task of identifying their locations and senses compared to inter-sentential implicits. This paper analyzes model performance in this context, highlighting results, limitations, and future research directions.
Li Liang, Zheng Zhao, Bonnie Webber
Workshop on Computational Approaches to Discourse (CODI) 2020
This work presents data to support the claim that the PDTB-3 contains many more implicit discourse relations, and methods that can serve as a non-trivial baseline for future state-of-the-art recognizers for implicit discourse relations.
Zheng Zhao, Shay Cohen, Bonnie Webber
EMNLP (Findings) 2020
Abstractive summaries often hallucinate unsupported content, but our system, Herman, mitigates this by verifying specific entities like dates and numbers, improving summary accuracy and earning higher ROUGE scores and human preference.