[2023-10-16] 오늘의 자연어처리

LLM-augmented Preference Learning from Natural Language

Abstract:Finding preferences expressed in natural language is an important but challenging task. State-of-the-art(SotA) methods leverage transformer-based models such as BERT, RoBERTa, etc. and graph neural architectures such as graph attention networks. Since Large Language Models (LLMs) are equipped to deal with larger context lengths and have much larger model sizes than the transformer-based model, we investigate their ability to classify comparative text directly. This work aims to serve as a first step towards using LLMs for the CPC task. We design and conduct a set of experiments that format the classification task into an input prompt for the LLM and a methodology to get a fixed-format response that can be automatically evaluated. Comparing performances with existing methods, we see that pre-trained LLMs are able to outperform the previous SotA models with no fine-tuning involved. Our results show that the LLMs can consistently outperform the SotA when the target text is large -- i.e. composed of multiple sentences --, and are still comparable to the SotA performance in shorter text. We also find that few-shot learning yields better performance than zero-shot learning.

초록:자연어로 표현된 기호를 찾는 것은 중요하지만 도전적인 일이다. 최첨단(SotA) 방법은 BERT, RoBERTA 등과 같은 트랜스포머 기반 모델과 그래프 주의 네트워크와 같은 그래프 신경망을 활용한다. Large Language Model(LLM)은 트랜스포머 기반 모델보다 더 큰 컨텍스트 길이를 처리하고 모델 크기가 훨씬 크기 때문에 비교 텍스트를 직접 분류할 수 있는 능력을 조사한다. 이 작업은 CPC 작업에 LLM을 사용하기 위한 첫 번째 단계가 되는 것을 목표로 한다. 우리는 분류 작업을 LLM에 대한 입력 프롬프트와 자동으로 평가될 수 있는 고정 형식 응답을 얻는 방법론으로 포맷하는 일련의 실험을 설계하고 수행한다. 기존 방법과 성능을 비교하면 사전 훈련된 LLM이 미세 조정 없이 이전 SotA 모델을 능가할 수 있음을 알 수 있다. 우리의 결과는 LLM이 대상 텍스트가 클 때(즉, 여러 문장으로 구성될 때) SotA를 지속적으로 능가할 수 있으며, 여전히 짧은 텍스트의 SotA 성능과 비교할 수 있음을 보여준다. 우리는 또한 퓨샷 학습이 제로샷 학습보다 더 나은 성능을 제공한다는 것을 발견했다.

MProto: Multi-Prototype Network with Denoised Optimal Transport for Distantly Supervised Named Entity Recognition

Abstract:Distantly supervised named entity recognition (DS-NER) aims to locate entity mentions and classify their types with only knowledge bases or gazetteers and unlabeled corpus. However, distant annotations are noisy and degrade the performance of NER models. In this paper, we propose a noise-robust prototype network named MProto for the DS-NER task. Different from previous prototype-based NER methods, MProto represents each entity type with multiple prototypes to characterize the intra-class variance among entity representations. To optimize the classifier, each token should be assigned an appropriate ground-truth prototype and we consider such token-prototype assignment as an optimal transport (OT) problem. Furthermore, to mitigate the noise from incomplete labeling, we propose a novel denoised optimal transport (DOT) algorithm. Specifically, we utilize the assignment result between Other class tokens and all prototypes to distinguish unlabeled entity tokens from true negatives. Experiments on several DS-NER benchmarks demonstrate that our MProto achieves state-of-the-art performance. The source code is now available on Github.

초록:원거리 감독 명명 개체 인식(DS-NER)은 지식 기반 또는 가제트와 라벨이 부착되지 않은 말뭉치만으로 개체 언급을 찾고 유형을 분류하는 것을 목표로 한다. 그러나 원격 주석은 잡음이 많고 NER 모델의 성능을 저하시킨다. 본 논문에서는 DS-NER 작업을 위해 MProto라는 이름의 노이즈가 강한 프로토타입 네트워크를 제안한다. MProto는 이전의 프로토타입 기반 NER 방법과 달리 각 엔티티 유형을 여러 프로토타입으로 표현하여 엔티티 표현 간의 클래스 내 분산을 특성화한다. 분류기를 최적화하려면 각 토큰에 적절한 지상 실측 프로토타입을 할당해야 하며 이러한 토큰 프로토타입 할당을 최적의 운송(OT) 문제로 간주한다. 또한 불완전한 라벨링으로 인한 잡음을 완화하기 위해 새로운 노이즈 제거 최적 전송(DOT) 알고리듬을 제안한다. 구체적으로, 우리는 라벨이 부착되지 않은 엔티티 토큰과 진정한 네거티브를 구별하기 위해 다른 클래스 토큰과 모든 프로토타입 사이의 할당 결과를 활용한다. 여러 DS-NER 벤치마크에 대한 실험을 통해 MProto가 최첨단 성능을 달성함을 입증했다. 소스 코드는 이제 Github에서 사용할 수 있다.

A Biomedical Knowledge Graph for Biomarker Discovery in Cancer

Abstract:Structured and unstructured data and facts about drugs, genes, protein, viruses, and their mechanism are spread across a huge number of scientific articles. These articles are a large-scale knowledge source and can have a huge impact on disseminating knowledge about the mechanisms of certain biological processes. A domain-specific knowledge graph~(KG) is an explicit conceptualization of a specific subject-matter domain represented w.r.t semantically interrelated entities and relations. A KG can be constructed by integrating such facts and data and be used for data integration, exploration, and federated queries. However, exploration and querying large-scale KGs is tedious for certain groups of users due to a lack of knowledge about underlying data assets or semantic technologies. Such a KG will not only allow deducing new knowledge and question answering(QA) but also allows domain experts to explore. Since cross-disciplinary explanations are important for accurate diagnosis, it is important to query the KG to provide interactive explanations about learned biomarkers. Inspired by these, we construct a domain-specific KG, particularly for cancer-specific biomarker discovery. The KG is constructed by integrating cancer-related knowledge and facts from multiple sources. First, we construct a domain-specific ontology, which we call OncoNet Ontology (ONO). The ONO ontology is developed to enable semantic reasoning for verification of the predictions for relations between diseases and genes. The KG is then developed and enriched by harmonizing the ONO, additional metadata schemas, ontologies, controlled vocabularies, and additional concepts from external sources using a BERT-based information extraction method. BioBERT and SciBERT are finetuned with the selected articles crawled from PubMed. We listed down some queries and some examples of QA and deducing knowledge based on the KG.

초록:약물, 유전자, 단백질, 바이러스 및 그 메커니즘에 관한 구조화되고 비구조화된 데이터 및 사실이 수많은 과학 논문에 걸쳐 퍼져 있다. 이러한 기사들은 대규모 지식원으로서 특정 생물학적 과정의 메커니즘에 대한 지식을 전파하는 데 큰 영향을 미칠 수 있다. 도메인별 지식 그래프~(KG)는 의미론적으로 상호 관련된 엔티티와 관계를 나타내는 특정 주제-물질 도메인의 명시적 개념화이다. 이러한 사실과 데이터를 통합하여 KG를 구축할 수 있으며, 데이터 통합, 탐색 및 연합 질의에 사용할 수 있다. 그러나 대규모 KG를 탐색하고 쿼리하는 것은 기본 데이터 자산 또는 의미 기술에 대한 지식 부족으로 인해 특정 그룹의 사용자에게 지루하다. 이러한 KG는 새로운 지식과 질의응답(QA)을 추론할 수 있을 뿐만 아니라 도메인 전문가들이 탐색할 수 있도록 해줄 것이다. 정확한 진단을 위해서는 범분야적 설명이 중요하기 때문에 학습된 바이오마커에 대한 상호작용적 설명을 제공하기 위해 KG를 질의하는 것이 중요하다. 이에 영감을 받아 특히 암 특이 바이오마커 발견을 위해 도메인별 KG를 구성한다. KG는 암 관련 지식과 여러 출처의 사실을 통합하여 구성된다. 먼저 도메인별 온톨로지를 구축하는데, 이를 ONO(OncoNet Ontology)라고 한다. ONO 온톨로지는 질병과 유전자 간의 관계에 대한 예측의 검증을 위한 의미론적 추론을 가능하게 하기 위해 개발되었다. 이후 BERT 기반의 정보 추출 방법을 사용하여 외부 소스로부터의 ONO, 추가 메타데이터 스키마, 온톨로지, 제어된 어휘 및 추가 개념을 조화시켜 KG를 개발하고 풍부하게 한다. BioBERT와 SciBERT는 PubMed에서 크롤링된 선택된 기사로 미세 조정된다. 우리는 KG를 기반으로 한 QA 및 추론 지식의 몇 가지 쿼리와 몇 가지 예를 나열했다.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-10-18] 오늘의 자연어처리 (1)	2023.10.18
[2023-10-17] 오늘의 자연어처리 (0)	2023.10.17
[2023-10-15] 오늘의 자연어처리 (0)	2023.10.15
[2023-10-14] 오늘의 자연어처리 (1)	2023.10.14
[2023-10-13] 오늘의 자연어처리 (0)	2023.10.13

잡다한 이야기

[2023-10-16] 오늘의 자연어처리

LLM-augmented Preference Learning from Natural Language

MProto: Multi-Prototype Network with Denoised Optimal Transport for Distantly Supervised Named Entity Recognition

A Biomedical Knowledge Graph for Biomarker Discovery in Cancer

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-10-16] 오늘의 자연어처리

LLM-augmented Preference Learning from Natural Language

MProto: Multi-Prototype Network with Denoised Optimal Transport for Distantly Supervised Named Entity Recognition

A Biomedical Knowledge Graph for Biomarker Discovery in Cancer

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바