[2023-04-27] 오늘의 자연어처리

Semantic Tokenizer for Enhanced Natural Language Processing

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword regularization. We present a novel tokenizer that uses semantics to drive vocabulary construction. The tokenizer includes a trainer that uses stemming to enhance subword formation. Further optimizations and adaptations are implemented to minimize the number of words that cannot be encoded. The encoder is updated to integrate with the trainer. The tokenizer is implemented as a drop-in replacement for the SentencePiece tokenizer. The new tokenizer more than doubles the number of wordforms represented in the vocabulary. The enhanced vocabulary significantly improves NLP model convergence, and improves quality of word and sentence embeddings. Our experimental results show top performance on two Glue tasks using BERT-base, improving on models more than 50X in size.

전통적으로 NLP 성능 개선은 개선에 초점이 맞춰져 왔다 모형 및 모형 모수의 수를 늘립니다. NLP 어휘 구성은 표현된 단어의 수를 최대화하는 데 초점을 맞추고 있다 서브워드 정규화를 통해. 우리는 다음을 사용하는 새로운 토큰화기를 제시한다 어휘 구성을 유도하는 의미론. 토키저에는 트레이너가 포함되어 있습니다 하위 단어 형성을 향상시키기 위해 string을 사용합니다. 추가 최적화 및 적응은 가능하지 않은 단어의 수를 최소화하기 위해 구현된다 부호화된. 인코더가 강사와 통합되도록 업데이트됩니다. 토큰화기는 SentencePiece 토큰화기의 드롭인 대체로 구현되었습니다. 더 뉴 토큰화기는 표현된 단어 형태의 수를 두 배 이상 증가시킨다 어휘. 향상된 어휘는 NLP 모델을 크게 향상시킨다 수렴을 통해 단어와 문장 임베딩의 품질을 향상시킵니다. 우리들의 실험 결과는 BERT 기반을 사용하는 두 가지 글루 작업에서 최고의 성능을 보여준다, 50배 이상 크기의 모델에서 개선.

A Novel Dual of Shannon Information and Weighting Scheme

Shannon Information theory has achieved great success in not only communication technology where it was originally developed for but also many other science and engineering fields such as machine learning and artificial intelligence. Inspired by the famous weighting scheme TF-IDF, we discovered that information entropy has a natural dual. We complement the classical Shannon information theory by proposing a novel quantity, namely troenpy. Troenpy measures the certainty, commonness and similarity of the underlying distribution. To demonstrate its usefulness, we propose a troenpy based weighting scheme for document with class labels, namely positive class frequency (PCF). On a collection of public datasets we show the PCF based weighting scheme outperforms the classical TF-IDF and a popular Optimal Transportation based word moving distance algorithm in a kNN setting. We further developed a new odds-ratio type feature, namely Expected Class Information Bias(ECIB), which can be regarded as the expected odds ratio of the information quantity entropy and troenpy. In the experiments we observe that including the new ECIB features and simple binary term features in a simple logistic regression model can further significantly improve the performance. The simple new weighting scheme and ECIB features are very effective and can be computed with linear order complexity.

Shannon Information 이론은 다음과 같은 점에서 큰 성공을 거두었다 통신 기술은 원래 그것이 많은 것을 위해 개발된 곳이다 기계학습과 인공지능과 같은 다른 과학과 공학 분야들 지성. 유명한 가중치 체계 TF-IDF에서 영감을 받아 우리는 발견했다 그 정보 엔트로피는 자연적인 이중성을 가지고 있다. 우리는 고전을 보완한다 새로운 양, 즉 트로엔피를 제안함으로써 섀넌 정보 이론. 트로엔피는 기초의 확실성, 공통성 및 유사성을 측정한다 분배. 그 유용성을 입증하기 위해, 우리는 트로엔피 기반을 제안한다 클래스 레이블이 있는 문서, 즉 양의 클래스에 대한 가중치 체계 주파수(PCF). 공개 데이터 세트 모음에서 우리는 PCF 기반을 보여준다 가중치 체계는 기존 TF-IDF 및 인기 있는 Optimal보다 성능이 우수하다 kNN 설정에서 전송 기반 단어 이동 거리 알고리즘. 우리가 새로운 오즈비 유형 기능, 즉 예상 클래스를 추가로 개발했다 정보 바이어스(ECIB), 이는 예상 승산비로 간주될 수 있다 정보량 엔트로피와 엔트로피. 실험에서 우리는 그것을 관찰한다 새로운 ECIB 기능과 간단한 이진 용어 기능을 단순하게 포함합니다 로지스틱 회귀 분석 모형을 사용하면 성능을 더욱 크게 향상시킬 수 있습니다. 단순한 새로운 가중치 체계와 ECIB 기능은 매우 효과적이며 다음과 같이 될 수 있다 선형 순서 복잡도로 계산됩니다.

Nondeterministic Stacks in Neural Networks

Human language is full of compositional syntactic structures, and although neural networks have contributed to groundbreaking improvements in computer systems that process language, widely-used neural network architectures still exhibit limitations in their ability to process syntax. To address this issue, prior work has proposed adding stack data structures to neural networks, drawing inspiration from theoretical connections between syntax and stacks. However, these methods employ deterministic stacks that are designed to track one parse at a time, whereas syntactic ambiguity, which requires a nondeterministic stack to parse, is extremely common in language. In this dissertation, we remedy this discrepancy by proposing a method of incorporating nondeterministic stacks into neural networks. We develop a differentiable data structure that efficiently simulates a nondeterministic pushdown automaton, representing an exponential number of computations with a dynamic programming algorithm. We incorporate this module into two predominant architectures: recurrent neural networks (RNNs) and transformers. We show that this raises their formal recognition power to arbitrary context-free languages, and also aids training, even on deterministic context-free languages. Empirically, neural networks with nondeterministic stacks learn context-free languages much more effectively than prior stack-augmented models, including a language with theoretically maximal parsing difficulty. We also show that an RNN augmented with a nondeterminsitic stack is capable of surprisingly powerful behavior, such as learning cross-serial dependencies, a well-known non-context-free pattern. We demonstrate improvements on natural language modeling and provide analysis on a syntactic generalization benchmark. This work represents an important step toward building systems that learn to use syntax in more human-like fashion.

인간의 언어는 구성적인 통사적 구조로 가득 차 있고, 비록 신경망은 컴퓨터의 획기적인 개선에 기여했다 언어를 처리하는 시스템, 여전히 널리 사용되는 신경망 아키텍처 구문을 처리하는 능력에 한계가 있습니다. 이 문제를 해결하기 위해, 이전 작업은 신경망에 스택 데이터 구조를 추가하는 것을 제안했다, 구문과 스택 간의 이론적 연결에서 영감을 끌어냅니다. 그러나, 이러한 방법들은 추적하도록 설계된 결정론적 스택을 사용한다 한 번에 하나의 구문 분석인 반면, 구문적 모호성은 필요하다 해석할 비결정론적 스택은 언어에서 매우 일반적이다. 이 점에서. 논문, 우리는 통합하는 방법을 제안함으로써 이 불일치를 해결한다 신경망에 대한 비결정론적 스택. 우리는 차별화 가능한 데이터를 개발한다 비결정론적 푸시다운 자동화를 효율적으로 시뮬레이션하는 구조, 동적 프로그래밍으로 기하급수적인 수의 계산을 표현하기 알고리즘. 우리는 이 모듈을 두 가지 주요 아키텍처에 통합합니다: 반복 신경망(RNN) 및 변압기. 우리는 이것이 증가한다는 것을 보여준다 임의의 문맥이 없는 언어에 대한 그들의 공식적인 인식력, 그리고 또한 결정론적 문맥이 없는 언어에서도 훈련을 돕는다. 경험적으로, 비결정론적 스택을 가진 신경망은 문맥이 없는 언어를 많이 배운다 언어를 포함하여 이전 스택 추적 모델보다 더 효과적입니다 이론적으로 최대 파싱 난이도입니다. 우리는 또한 RNN이 증가했다는 것을 보여준다 결정성이 없는 스택은 놀라울 정도로 강력한 행동을 할 수 있습니다, 예를 들어 교차 의존성을 학습하는 것과 같이, 잘 알려진 비흡연자가 없다 양식. 우리는 자연어 모델링의 개선을 보여주고 제공한다 통사적 일반화 벤치마크에 대한 분석. 이 작품은 다음을 대표한다 구문을 더 많이 사용하는 방법을 배우는 시스템을 구축하기 위한 중요한 단계 인간다운 패션.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-04-29] 오늘의 자연어처리 (0)	2023.04.29
[2023-04-28] 오늘의 자연어처리 (0)	2023.04.28
[2023-04-26] 오늘의 자연어처리 (0)	2023.04.26
[2023-04-25] 오늘의 자연어처리 (0)	2023.04.25
[2023-04-24] 오늘의 자연어처리 (0)	2023.04.24

잡다한 이야기

[2023-04-27] 오늘의 자연어처리

Semantic Tokenizer for Enhanced Natural Language Processing

A Novel Dual of Shannon Information and Weighting Scheme

Nondeterministic Stacks in Neural Networks

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-04-27] 오늘의 자연어처리

Semantic Tokenizer for Enhanced Natural Language Processing

A Novel Dual of Shannon Information and Weighting Scheme

Nondeterministic Stacks in Neural Networks

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바