[2023-10-30] 오늘의 자연어처리

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

LLM들이 그들의 역할을 언어의 통계적 모델링으로부터 다음으로 바꾸면서 범용 인공지능 에이전트 역할, LLM 평가는 어떻게 바뀌어야 할까? AI 에이전트의 핵심 능력은 필요에 따라 유연하게 결합하는 것입니다 기본적인 기술을 익혔습니다. 기술을 결합할 수 있는 능력은 (인간) 교육학과 출현현상에 관한 논문에서 중요한 역할을 한다 (Arora & Goyal, 2023). 이 작업에서는 능력을 측정하기 위한 새로운 평가인 스킬 믹스를 소개합니다 기예를 겸비하다. 평가자는 $N$ 스킬 목록을 사용하여 반복적으로 선택합니다 $k$ 스킬의 무작위 하위 집합과 LLM에게 그것을 결합한 텍스트를 생성하도록 요청한다 기술의 하위 집합. 하위 집합의 수가 $N^k$처럼 증가하기 때문에, 보통 수준에서도 $k$ 이 평가는 높은 확률로 LLM이 생성해야 합니다 교육 세트의 텍스트와 크게 다른 텍스트. 종이를 (a) 그러한 평가를 설계하고 관리하기 위한 방법론을 개발한다, (b) 다음을 사용하여 결과를 자동으로 채점(및 사람에 의한 스폿 검사)한다 GPT-4와 개방형 LLAMA-270B 모델. 의 버전을 인기 있는 챗봇에 관리하는 것은 다음과 같은 결과를 제공했다 일반적으로 이전의 예상과 일치하며, 놀라움을 포함하고 있습니다. 사이즈가능 모델 능력에 의해 포착되지 않는 모델 능력들 사이에 차이가 존재한다 인기 LLM 리더보드에서 순위를 차지했다. 또한 단순한 확률 계산은 GPT-4가 합리적이라는 것을 보여준다 $k=5$에 대한 성능은 "stoch적 앵무새" 행동을 넘어서는 것을 암시한다 (Bender et al., 2021), 즉, 보지 못했던 방식으로 스킬을 결합한다 훈련중에. 방법론이 어떻게 스킬 믹스 기반의 에코 시스템으로 이어질 수 있는지 스케치합니다 미래 모델의 AI 능력에 대한 공개 평가.

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

본 논문은 대규모를 이용한 최근의 3가지 실험에 대해 보고한다 가나 학생들의 구술 읽기 유창성(ORF)을 평가하기 위한 스피치 모델. ORF는 기초 문해력을 평가하는 잘 확립된 척도이지만 일반적으로 학생과 훈련받은 사람 사이에 일대일 세션이 필요하다 시간이 많이 걸리고 비용이 많이 드는 과정인 평가자. 자동화 ORF의 평가는 특히 더 나은 리터러시 교육을 지원할 수 있다 많은 계층으로 인해 형성 평가가 흔하지 않은 교육적 맥락들 크기 및 한정된 리소스. 우리가 아는 한, 이 연구는 최초의 것 중 하나이다 대규모 음성 모델의 최신 버전 사용을 조사하다 (Wisper V2 wav2vec2.0)은 글로벌 사우스 지역의 ORF 평가를 위한 것이다. 우리는 Whisper V2가 가나 학생들이 읽고 있는 것을 전사한 것을 발견했다 단어 오류율이 13.5인 큰 소리로 표시됩니다. 이는 모델의 평균 WER에 근접합니다 성인 연설(12.8)에 관하여 그리고 최첨단으로 여겨졌을 것이다 불과 몇 년 전의 아동용 음성 기록. 우리는 또한 그것을 발견한다 이러한 전사는 완전 자동화된 ORF 점수를 생성하는데 사용된다 전문적인 인간 채점자에 의해 생성된 점수와 상관관계를 밀접하게 일치시킨다 계수가 0.96입니다. 중요한 것은, 이 결과들이 A에 의해 성취되었다는 것이다 대표적인 데이터 세트(즉, 지역 억양을 가진 학생, 기록 촬영) 실제 강의실에서), 자유롭고 공개적으로 사용 가능한 연설 모델을 사용한다 박스(즉, 미세 조정 없음). 이것은 대규모 연설을 사용하는 것이 ORF를 평가하는 모델은 더 낮은 자원에서 구현하고 확장하는 것이 가능할 수 있다, 언어적으로 다양한 교육적 맥락들.

Towards Matching Phones and Speech Representations

Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.

전화 인스턴스에서 전화 종류를 배우는 것은 오래된 문제였다, 아직 열려있는 동안에. 이 작업에서 우리는 이 문제를 다음과 같은 맥락에서 재검토한다 자기 지도 학습을 하고, 그것을 일치하는 군집의 문제로 간주한다 중심에서 전화 내장까지. 우리는 다음을 가능하게 하는 두 가지 주요 속성을 연구한다 매칭, 즉 자기 감독 표현의 클러스터 중심체 여부 전화 인스턴스의 변동성을 줄이고 사이의 관계를 존중한다 전화기들이요. 그런 다음 일치 결과를 사용하여 유사 레이블을 생성하고 소개합니다 자기 감독 표현을 개선하기 위한 새로운 손실 함수입니다. 우리들의 실험은 일치하는 결과가 사이의 관계를 포착한다는 것을 보여준다 전화기들이요. 정기적인 자체 감독과 공동으로 새로운 손실 기능 교육 APC 및 CPC와 같은 손실은 다운스트림 폰을 크게 개선합니다 분류.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-11-02] 오늘의 자연어처리 (1)	2023.11.02
[2023-10-31] 오늘의 자연어처리 (1)	2023.10.31
[2023-10-29] 오늘의 자연어처리 (0)	2023.10.29
[2023-10-28] 오늘의 자연어처리 (1)	2023.10.28
[2023-10-27] 오늘의 자연어처리 (0)	2023.10.27

잡다한 이야기

[2023-10-30] 오늘의 자연어처리

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

Towards Matching Phones and Speech Representations

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-10-30] 오늘의 자연어처리

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

Towards Matching Phones and Speech Representations

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바