[2023-09-30] 오늘의 자연어처리

Human Feedback is not Gold Standard

Abstract:Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.

초록:인적 피드백은 Large Language Models의 성능을 평가하는 사실상의 기준이 되었고, 점점 더 훈련 목적으로 사용되고 있다. 그러나 이 단일 '선호도' 점수가 생성된 출력의 어떤 속성을 포착하는지는 명확하지 않다. 우리는 선호도 점수가 주관적이고 바람직하지 않은 편견에 열려 있다고 가정한다. 우리는 훈련과 평가 모두에 대한 인간 피드백의 사용을 비판적으로 분석하여 중요한 오류 기준 범위를 완전히 포착하는지 검증한다. 선호도 점수는 상당히 우수한 범위를 가지고 있지만 사실성과 같은 중요한 측면을 과소 대표한다는 것을 발견했다. 또한 선호도 점수와 오류 주석 모두 교란 요인의 영향을 받을 수 있다고 가정하고 명령 조정 모델을 활용하여 두 가지 가능한 교란 요인 차원인 주장과 복잡성을 따라 변화하는 출력을 생성한다. 우리는 출력의 주장이 사실성 오류의 인지된 비율을 왜곡한다는 것을 발견했으며, 이는 인간 주석이 완전히 신뢰할 수 있는 평가 지표나 훈련 목표가 아니라는 것을 나타낸다. 마지막으로, 우리는 훈련 목표로 인간 피드백을 사용하면 모델 출력의 주장력이 불균형적으로 증가한다는 예비 증거를 제공한다. 향후 작업에서는 선호도 점수가 원하는 목표와 잘 일치하는지 신중하게 검토할 것을 권장합니다.

Stress Testing Chain-of-Thought Prompting for Large Language Models

Abstract:This report examines the effectiveness of Chain-of-Thought (CoT) prompting in improving the multi-step reasoning abilities of large language models (LLMs). Inspired by previous studies \cite{Min2022RethinkingWork}, we analyze the impact of three types of CoT prompt perturbations, namely CoT order, CoT values, and CoT operators on the performance of GPT-3 on various tasks. Our findings show that incorrect CoT prompting leads to poor performance on accuracy metrics. Correct values in the CoT is crucial for predicting correct answers. Moreover, incorrect demonstrations, where the CoT operators or the CoT order are wrong, do not affect the performance as drastically when compared to the value based perturbations. This research deepens our understanding of CoT prompting and opens some new questions regarding the capability of LLMs to learn reasoning in context.

초록:이 보고서는 대형 언어 모델(LLM)의 다단계 추론 능력을 향상시키는 데 있어 사고 사슬(CoT) 촉진의 효과를 검토한다. 이전 연구 \cite{Min2022RethinkingWork}에서 영감을 받아 다양한 작업에 대한 CoT 순서, CoT 값 및 CoT 연산자의 세 가지 유형의 CoT 신속 섭동이 GPT-3의 성능에 미치는 영향을 분석한다. 우리의 연구 결과는 잘못된 CoT 프롬프트가 정확도 메트릭의 성능 저하로 이어진다는 것을 보여준다. CoT에서 올바른 값은 정답을 예측하는 데 중요하다. 또한 CoT 연산자 또는 CoT 순서가 잘못된 잘못된 시연은 값 기반 섭동과 비교할 때 성능에 크게 영향을 미치지 않는다. 이 연구는 CoT 촉진에 대한 우리의 이해를 심화시키고 LLM이 문맥에서 추론을 배울 수 있는 능력에 관한 몇 가지 새로운 질문을 연다.

Social Media Fashion Knowledge Extraction as Captioning

Abstract:Social media plays a significant role in boosting the fashion industry, where a massive amount of fashion-related posts are generated every day. In order to obtain the rich fashion information from the posts, we study the task of social media fashion knowledge extraction. Fashion knowledge, which typically consists of the occasion, person attributes, and fashion item information, can be effectively represented as a set of tuples. Most previous studies on fashion knowledge extraction are based on the fashion product images without considering the rich text information in social media posts. Existing work on fashion knowledge extraction in social media is classification-based and requires to manually determine a set of fashion knowledge categories in advance. In our work, we propose to cast the task as a captioning problem to capture the interplay of the multimodal post information. Specifically, we transform the fashion knowledge tuples into a natural language caption with a sentence transformation method. Our framework then aims to generate the sentence-based fashion knowledge directly from the social media post. Inspired by the big success of pre-trained models, we build our model based on a multimodal pre-trained generative model and design several auxiliary tasks for enhancing the knowledge extraction. Since there is no existing dataset which can be directly borrowed to our task, we introduce a dataset consisting of social media posts with manual fashion knowledge annotation. Extensive experiments are conducted to demonstrate the effectiveness of our model.

초록:매일 수많은 패션 관련 게시물이 생성되는 패션산업의 활성화에는 소셜미디어의 역할이 크다. 게시물에서 풍부한 패션 정보를 얻기 위해 소셜 미디어 패션 지식 추출 작업을 연구한다. 패션 지식은 일반적으로 행사, 인물 속성, 패션 아이템 정보로 구성되어 있으며, 튜플의 집합으로 효과적으로 표현될 수 있다. 패션지식 추출에 관한 대부분의 선행연구들은 소셜 미디어 게시물의 풍부한 텍스트 정보를 고려하지 않고 패션 제품 이미지를 기반으로 하고 있다. 소셜 미디어에서 패션 지식 추출에 대한 기존의 연구는 분류 기반이며, 일련의 패션 지식 범주를 사전에 수동으로 결정해야 한다. 우리의 연구에서는 다중 모드 게시물 정보의 상호 작용을 포착하기 위해 작업을 캡션 문제로 캐스팅할 것을 제안한다. 구체적으로, 우리는 문장 변환 방법으로 패션 지식 튜플을 자연어 캡션으로 변환한다. 그런 다음 우리의 프레임워크는 소셜 미디어 게시물에서 문장 기반 패션 지식을 직접 생성하는 것을 목표로 한다. 사전 훈련된 모델의 큰 성공에 영감을 받아 다중 모드 사전 훈련된 생성 모델을 기반으로 모델을 구축하고 지식 추출을 향상시키기 위한 몇 가지 보조 작업을 설계한다. 작업에 직접 차용할 수 있는 기존 데이터 세트가 없기 때문에 수동 패션 지식 주석이 있는 소셜 미디어 게시물로 구성된 데이터 세트를 소개한다. 우리 모델의 효과를 입증하기 위해 광범위한 실험이 수행된다.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-10-02] 오늘의 자연어처리 (1)	2023.10.02
[2023-10-01] 오늘의 자연어처리 (1)	2023.10.01
[2023-09-29] 오늘의 자연어처리 (0)	2023.09.29
[2023-09-28] 오늘의 자연어처리 (0)	2023.09.28
[2023-09-27] 오늘의 자연어처리 (0)	2023.09.27

잡다한 이야기

[2023-09-30] 오늘의 자연어처리

Human Feedback is not Gold Standard

Stress Testing Chain-of-Thought Prompting for Large Language Models

Social Media Fashion Knowledge Extraction as Captioning

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-09-30] 오늘의 자연어처리

Human Feedback is not Gold Standard

Stress Testing Chain-of-Thought Prompting for Large Language Models

Social Media Fashion Knowledge Extraction as Captioning

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바