[2023-11-30] 오늘의 자연어처리

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

Abstract:As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.

추상화:LLM(Large Language Model)의 스케일링이 그들의 능력을 극적으로 향상시킴에 따라, 그들의 책임감 있고 윤리적인 사용을 보장하기 위한 정렬 문제에 대한 관심이 증가하고 있다. 기존의 정렬 노력은 주로 HHH 원리와 같은 보편적 가치에 초점을 맞추고 있는 반면, 본질적으로 다원적이고 다양한 문화의 측면은 적절한 관심을 받지 못하고 있다. 이 연구는 LLMs의 문화적 차원을 평가하기 위한 새로운 벤치마크인 CDEval을 도입한다. CDEval은 7개 도메인에 걸쳐 6개의 문화적 차원을 포괄하는 GPT-4의 자동화된 생성과 인간 검증을 모두 통합하여 구성된다. 우리의 포괄적인 실험은 다양한 차원과 영역에 걸친 일관성과 변화를 강조하면서 주류 LLM의 문화에 대한 흥미로운 통찰력을 제공한다. 이 결과는 특히 다양한 문화적 환경에서의 적용을 위해 LLM 개발에서 문화적 고려를 통합하는 것이 중요하다는 것을 강조한다. CDEval을 통해 문화적 차원을 포함함으로써 LLM 정렬 연구의 지평을 넓히는 것을 목표로 하며, 따라서 향후 LLM의 개발 및 평가를 위한 보다 전체적인 프레임워크를 제공한다. 이 벤치마크는 LLM에서 문화연구를 위한 귀중한 자료로 작용하여 보다 문화적으로 인지하고 민감한 모델을 위한 길을 열어준다.

The Falcon Series of Open Language Models

Abstract:We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.

추상화:우리는 Falcon 시리즈를 소개한다: 7B, 40B 및 180B 매개변수 인과 디코더 전용 모델은 주로 웹 데이터에서 조립된 다양한 고품질 코퍼스에서 훈련되었다. 가장 큰 모델인 Falcon-180B는 공개적으로 문서화된 가장 큰 텍스트 토큰인 3조 5천억 개 이상의 텍스트 토큰에 대해 교육을 받았습니다. 팔콘-180B는 PaLM이나 친칠라와 같은 모델들을 크게 능가하며, LLaMA 2나 인플렉션-1과 같은 모델들을 동시에 개발할 때 향상된다. 그것은 감소된 사전 훈련 및 추론 비용으로 PaLM-2-Large의 성능에 근접하여, 우리가 아는 한, GPT-4 및 PaLM-2-Large와 함께 세계 3대 언어 모델 중 하나가 되었다. 우리는 Falcon을 사전 교육하는 데 사용되는 방법과 사용자 정의 도구에 대한 자세한 정보뿐만 아니라 자세한 평가를 보고합니다. 특히, 우리는 맞춤형 분산 교육 코드 기반에 대해 보고하므로 상호 연결이 제한된 클라우드 AWS 인프라에서 최대 4,096개의 A100에서 이러한 모델을 효율적으로 사전 교육할 수 있다. 저희는 오픈 사이언스를 육성하고 대규모 언어 모델의 오픈 생태계 개발을 가속화하기 위해 허가 라이선스에 따라 웹 데이터 세트의 600B 토큰 추출물과 Falcon-7/40/180B 모델을 출시합니다.

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

Abstract:Recent advances in neural methods have led to substantial improvement in the quality of Neural Machine Translation (NMT) systems. However, these systems frequently produce translations with inaccurate gender (Stanovsky et al., 2019), which can be traced to bias in training data. Saunders and Byrne (2020) tackle this problem with a handcrafted dataset containing balanced gendered profession words. By using this data to fine-tune an existing NMT model, they show that gender bias can be significantly mitigated, albeit at the expense of translation quality due to catastrophic forgetting. They recover some of the lost quality with modified training objectives or additional models at inference. We find, however, that simply supplementing the handcrafted dataset with a random sample from the base model training corpus is enough to significantly reduce the catastrophic forgetting. We also propose a novel domain-adaptation technique that leverages in-domain data created with the counterfactual data generation techniques proposed by Zmigrod et al. (2019) to further improve accuracy on the WinoMT challenge test set without significant loss in translation quality. We show its effectiveness in NMT systems from English into three morphologically rich languages French, Spanish, and Italian. The relevant dataset and code will be available at Github.

추상화:최근 신경 방법의 발전은 신경 기계 번역(NMT) 시스템의 품질을 실질적으로 향상시켰다. 그러나 이러한 시스템은 종종 부정확한 성별을 가진 번역을 생성하며(Stanovsky et al., 2019), 이는 훈련 데이터의 편향으로 추적될 수 있다. Saunders와 Byrne(2020)은 균형 잡힌 성별 전문 단어를 포함하는 수작업 데이터 세트로 이 문제를 해결한다. 이 데이터를 사용하여 기존 NMT 모델을 미세 조정함으로써 치명적인 망각으로 인해 번역 품질을 희생하더라도 성별 편향이 크게 완화될 수 있음을 보여준다. 그들은 수정된 훈련 목표 또는 추론 시 추가 모델로 손실된 품질의 일부를 복구한다. 그러나 기본 모델 훈련 코퍼스의 무작위 샘플로 수공 데이터 세트를 보완하는 것만으로도 치명적인 망각을 크게 줄이기에 충분하다는 것을 발견했다. 또한 Zmigrod et al.(2019)에서 제안한 반사실 데이터 생성 기법으로 생성된 도메인 내 데이터를 활용하여 번역 품질의 큰 손실 없이 WinoMT 챌린지 테스트 세트의 정확도를 더욱 향상시키는 새로운 도메인 적응 기법을 제안한다. 우리는 영어에서 프랑스어, 스페인어, 이탈리아어의 세 가지 형태학적으로 풍부한 언어로 NMT 시스템에서 그 효과를 보여준다. 관련 데이터 세트와 코드는 Github에서 사용할 수 있습니다.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-12-02] 오늘의 자연어처리 (1)	2023.12.02
[2023-12-01] 오늘의 자연어처리 (1)	2023.12.01
[2023-11-29] 오늘의 자연어처리 (0)	2023.11.29
[2023-11-28] 오늘의 자연어처리 (1)	2023.11.28
[2023-11-27] 오늘의 자연어처리 (1)	2023.11.27

잡다한 이야기

[2023-11-30] 오늘의 자연어처리

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

The Falcon Series of Open Language Models

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-11-30] 오늘의 자연어처리

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

The Falcon Series of Open Language Models

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바