[2023-10-09] 오늘의 자연어처리

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Abstract:Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.

초록:다운스트림 사용 사례에 대해 대형 언어 모델(LLM)을 최적화하는 것은 종종 추가 미세 조정을 통해 사전 훈련된 LLM의 커스터마이징을 수반한다. 메타 라마 모델 공개 및 오픈맞춤형 데이터셋에서 GPT-3.5 터보를 미세 조정하기 위한 AI의 API도 이런 관행을 부추긴다. 하지만, 그러한 맞춤 미세 조정과 관련된 안전 비용은 무엇일까? 기존의 안전 정렬 인프라는 추론 시간에 LLM의 유해한 동작을 제한할 수 있지만, 미세 조정 권한이 최종 사용자로 확장될 때 안전 위험을 다루지 않는다는 점에 주목한다. 우리의 적색 팀 구성 연구는 LLM의 안전 정렬이 몇 가지 적대적으로 설계된 훈련 예제만으로 미세 조정됨으로써 손상될 수 있음을 발견했다. 예를 들어, GPT-3.5 Turbo의 안전 가드레일을 Open을 통해 $0.20 미만의 비용으로 단 10개의 예에서 미세 조정함으로써 탈옥시킨다AI의 API는 모델이 거의 모든 유해한 지시에 대응하도록 만든다. 또한 우리의 연구는 악의적인 의도가 없더라도 양성 데이터 세트와 일반적으로 사용되는 데이터 세트로 단순히 미세 조정하는 것만으로도 LLM의 안전 정렬을 의도치 않게 저하시킬 수 있다는 것을 보여준다. 이러한 결과는 정렬된 LLM을 미세 조정하면 현재의 안전 인프라가 해결하기에는 부족한 새로운 안전 위험이 발생한다는 것을 시사한다. 모델의 초기 안전 정렬이 흠잡을 데 없을지라도 맞춤형 미세 조정 후에 반드시 유지되어야 하는 것은 아니다. 우리는 잠재적 완화를 개략적으로 분석하고 비판적으로 분석하며 정렬된 LLM의 맞춤 미세 조정을 위한 안전 프로토콜을 강화하기 위한 추가 연구 노력을 지지한다.

A Long Way to Go: Investigating Length Correlations in RLHF

Abstract:Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.

초록:인간 피드백으로부터의 강화 학습(RLHF)을 사용하여 큰 언어 모델을 정렬하는 데 큰 성공이 보고되었다. 오픈 소스 선호 데이터 세트와 보상 모델은 일반적인 채팅 설정을 넘어 더 넓은 실험을 가능하게 했으며, 특히 웹 질문 응답, 요약 및 다중 턴 대화와 같은 작업에 시스템을 더 "도움이 되도록" 만들었다. 유용성을 최적화할 때, RLHF는 더 긴 출력을 생성하기 위해 모델을 구동하는 것으로 지속적으로 관찰되었다. 본 논문은 응답 길이에 대한 최적화가 이러한 환경에서 보고된 RLHF의 개선의 중요한 요인임을 보여준다. 먼저, 도움이 되기 위해 세 가지 오픈 소스 선호 데이터 세트에서 훈련된 보상 모델에 대한 보상과 길이 사이의 관계를 연구한다. 여기서, 길이는 보상과 강한 상관관계를 가지며, 보상 점수의 개선은 상당 부분 출력 길이에 대한 분포를 이동시킴으로써 추진된다. 그런 다음 길이를 늘리지 않고 RLHF와 동일한 다운스트림 개선을 달성할 수 있는지 확인하기 위해 RL 및 보상 모델 학습 동안의 개입을 탐구한다. 당사의 개입은 길이 증가를 완화시키지만, 설정 전반에 걸쳐 균일하게 효과적이지는 않습니다. 또한 길이에만 기반한 보상으로 RLHF를 실행해도 초기 정책 모델에 비해 대부분의 다운스트림 개선을 재현할 수 있음을 발견하여 이러한 설정의 보상 모델이 갈 길이 멀다는 것을 보여준다.

Redefining Digital Health Interfaces with Large Language Models

Abstract:Digital health tools have the potential to significantly improve the delivery of healthcare services. However, their use remains comparatively limited due, in part, to challenges surrounding usability and trust. Recently, Large Language Models (LLMs) have emerged as general-purpose models with the ability to process complex information and produce human-quality text, presenting a wealth of potential applications in healthcare. Directly applying LLMs in clinical settings is not straightforward, with LLMs susceptible to providing inconsistent or nonsensical answers. We demonstrate how LLMs can utilize external tools to provide a novel interface between clinicians and digital technologies. This enhances the utility and practical impact of digital healthcare tools and AI models while addressing current issues with using LLM in clinical settings such as hallucinations. We illustrate our approach with examples from cardiovascular disease and diabetes risk prediction, highlighting the benefit compared to traditional interfaces for digital tools.

초록:디지털 건강 도구는 의료 서비스 제공을 크게 향상시킬 가능성이 있다. 그러나 부분적으로 사용성과 신뢰성을 둘러싼 문제로 인해 이들의 사용은 비교적 제한적으로 남아 있다. 최근 Large Language Models(LLM)는 복잡한 정보를 처리하고 인간 품질의 텍스트를 생산할 수 있는 능력을 갖춘 범용 모델로 부상하여 의료 분야에서 잠재적인 응용 분야를 풍부하게 제시하고 있다. LLM을 임상 환경에서 직접 적용하는 것은 간단하지 않으며, LLM은 일관되지 않거나 무의미한 답변을 제공하기 쉽다. 우리는 LLM이 어떻게 외부 도구를 사용하여 임상의와 디지털 기술 사이의 새로운 인터페이스를 제공할 수 있는지 보여준다. 이는 환각과 같은 임상 환경에서 LLM을 사용하는 것에 대한 현재의 문제를 해결하는 동시에 디지털 헬스케어 도구와 AI 모델의 유용성과 실용적인 효과를 향상시킨다. 우리는 기존의 디지털 도구 인터페이스와 비교하여 이점을 강조하면서 심혈관 질환과 당뇨병 위험 예측의 예를 들어 우리의 접근 방식을 설명한다.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-10-11] 오늘의 자연어처리 (0)	2023.10.11
[2023-10-10] 오늘의 자연어처리 (1)	2023.10.10
[2023-10-08] 오늘의 자연어처리 (0)	2023.10.08
[2023-10-07] 오늘의 자연어처리 (1)	2023.10.07
[2023-10-05] 오늘의 자연어처리 (0)	2023.10.05

잡다한 이야기

[2023-10-09] 오늘의 자연어처리

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

A Long Way to Go: Investigating Length Correlations in RLHF

Redefining Digital Health Interfaces with Large Language Models

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-10-09] 오늘의 자연어처리

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

A Long Way to Go: Investigating Length Correlations in RLHF

Redefining Digital Health Interfaces with Large Language Models

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바