본문 바로가기
오늘의 자연어 처리

[2022-08-15] 오늘의 자연어처리

by 지환이아빠 2022. 8. 15.
반응형

The Moral Foundations Reddit Corpus

 

Moral framing and sentiment can affect a variety of online and offline behaviors, including donation, pro-environmental action, political engagement, and even participation in violent protests. Various computational methods in Natural Language Processing (NLP) have been used to detect moral sentiment from textual data, but in order to achieve better performances in such subjective tasks, large sets of hand-annotated training data are needed. Previous corpora annotated for moral sentiment have proven valuable, and have generated new insights both within NLP and across the social sciences, but have been limited to Twitter. To facilitate improving our understanding of the role of moral rhetoric, we present the Moral Foundations Reddit Corpus, a collection of 16,123 Reddit comments that have been curated from 12 distinct subreddits, hand-annotated by at least three trained annotators for 8 categories of moral sentiment (i.e., Care, Proportionality, Equality, Purity, Authority, Loyalty, Thin Morality, Implicit/Explicit Morality) based on the updated Moral Foundations Theory (MFT) framework. We use a range of methodologies to provide baseline moral-sentiment classification results for this new corpus, e.g., cross-domain classification and knowledge transfer.

 

도덕적 체계와 감정은 다양한 온라인과 오프라인에 영향을 미칠 수 있다. 기부, 환경 보호 활동, 정치 참여, 심지어 폭력 시위에 참여하기도 한다. 의 다양한 계산 방법 자연어 처리(NLP)는 다음의 도덕적 감정을 감지하기 위해 사용되어 왔다. 텍스트 데이터, 그러나 그러한 주관적 측면에서 더 나은 성과를 얻기 위해 작업, 많은 수의 손으로 직접 연습하는 훈련 데이터가 필요하다. 이전 말뭉치 도덕적 정서를 위해 주석이 달린 것은 가치가 있는 것으로 증명되었고, 새로운 것을 만들어냈다. NLP 내 및 사회 과학 전반에 걸친 통찰력, 그러나 제한적이었다. Twitter에. 도덕적 역할에 대한 우리의 이해를 증진시키기 위해 수사학, 우리는 모럴 파운데이션 레딧 코퍼스를 발표한다. 16,123 레딧 코멘트는 12개의 구별되는 하위 레딧에서 큐레이션되었습니다. 8개 범주의 도덕에 대해 적어도 3명의 훈련된 주석자에 의해 손으로 다루어졌다. 정서(예: 주의, 비례, 평등, 순결, 권위, 충성, 업데이트된 모럴을 기반으로 한 얇은 도덕성, 암묵적/명시적 도덕성) 기초 이론(MFT) 프레임워크. 다양한 방법론을 사용하여 이 새로운 말뭉치에 대한 도덕적 기준 분류 결과(예: 도메인 간 분류 및 지식 이전. 

 

 

Overview of CTC 2021: Chinese Text Correction for Native Speakers

 

In this paper, we present an overview of the CTC 2021, a Chinese text correction task for native speakers. We give detailed descriptions of the task definition and the data for training as well as evaluation. We also summarize the approaches investigated by the participants of this task. We hope the data sets collected and annotated for this task can facilitate and expedite future development in this research area. Therefore, the pseudo training data, gold standards validation data, and entire leaderboard is publicly available online at this https URL.

 

 

 

 

Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

 

This paper proposes a simple yet effective method to improve direct (X-to-Y) translation for both cases: zero-shot and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the proposed setup. In the experiments, our method shows nearly 10.0 BLEU points gain on in-house datasets depending on the checkpoint selection criteria. In a WMT evaluation campaign, From-English performance improves by 4.17 and 2.87 BLEU points, in the zero-shot setting, and when direct data is available for training, respectively. While X-to-Y improves by 1.29 BLEU over the zero-shot baseline, and 0.44 over the many-to-many baseline. In the low-resource setting, we see a 1.5~1.7 point improvement when finetuning on X-to-Y domain data.

 

 

 

 

반응형

댓글