[2023-12-08] 오늘의 자연어처리

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

Abstract:Through additional training, we explore embedding specialized scientific knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that effective knowledge integration requires reading texts from multiple perspectives, especially in instructional formats. We utilize text augmentation to tackle the scarcity of specialized texts, including style conversions and translations. Hyperparameter optimization proves crucial, with different size models (7b, 13b, and 70b) reasonably undergoing additional training. Validating our methods, we construct a dataset of 65,000 scientific papers. Although we have succeeded in partially embedding knowledge, the study highlights the complexities and limitations of incorporating specialized information into LLMs, suggesting areas for further improvement.

추상화:추가 훈련을 통해 라마 2 대언어 모델(LLM)에 특화된 과학적 지식을 내장하는 방안을 모색한다. 주요 연구 결과는 효과적인 지식 통합을 위해서는 특히 수업 형식에서 다각적인 관점에서 텍스트를 읽어야 한다는 것을 보여준다. 우리는 텍스트 증강을 사용하여 스타일 변환 및 번역을 포함한 전문 텍스트의 부족을 해결합니다. 하이퍼 파라미터 최적화는 다양한 크기 모델(7b, 13b 및 70b)이 합리적으로 추가 훈련을 거치는 등 매우 중요한 것으로 입증된다. 우리의 방법을 검증하면서 65,000개의 과학 논문 데이터 세트를 구성한다. 우리는 지식을 부분적으로 내장하는 데 성공했지만, 이 연구는 전문 정보를 LLM에 통합하는 것의 복잡성과 한계를 강조하여 추가 개선해야 할 부분을 제안한다.

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Abstract:Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

추상화:GPT-3, OPT 및 LLaMA와 같은 대형 언어 모델(LLM)은 광범위한 작업에서 현저한 정확도를 입증했다. 그러나, 이러한 모델들을 훈련하는 것은 상당한 비용을 발생시킬 수 있으며, 종종 몇 달 동안 지속적으로 작동하기 위해 수만 개의 GPU를 필요로 한다. 통상적으로, 이 훈련은 균질한 고속 원격 직접 메모리(RDMA) 네트워크 인터페이스 카드(NIC)를 구비한 전문 GPU 클러스터에서 수행된다. 이와 같은 전용 클러스터의 확보 및 유지관리가 어려운 실정이다. 메가트론-LM 및 메가트론-딥스피드와 같은 현재의 LLM 훈련 프레임워크는 주로 동종 클러스터 설정 내에서 훈련을 최적화하는 데 중점을 둔다. 본 논문에서는 이기종 NIC 환경에서 신중하게 만들어진 데이터와 모델 병렬화 전략을 사용하는 LLM의 훈련 프레임워크인 Holmes를 소개한다. 우리의 주요 기술적 기여는 연결된 NIC의 특성을 기반으로 특정 그룹의 GPU 장치에 LLM 훈련에서 별개의 계산 태스크렛을 지능적으로 할당하는 새로운 스케줄링 방법에 있다. 또한 파이프라인 병렬 기법을 사용하여 제안된 프레임워크는 서로 다른 클러스터의 노드 간 고속 상호 연결이 없는 시나리오에서도 여러 GPU 클러스터에 대한 확장성을 보여준다. 이종 NIC 환경에서 다양한 시나리오를 포함하는 포괄적인 실험을 수행했습니다. 대부분의 경우, 우리의 프레임워크는 동종 RDMA 지원 네트워크(InfiniBand 또는 ROCE)로 달성할 수 있는 성능 수준에 근접하여 순수 이더넷 환경 내에서 훈련 효율성을 크게 상회한다. 또한 훈련 효율성 측면에서 이기종 NIC 환경에서 우리의 프레임워크가 다른 주류 LLM 프레임워크를 능가하고 이들과 원활하게 통합될 수 있음을 확인했다.

DBCopilot: Scaling Natural Language Querying to Massive Databases

Abstract:Text-to-SQL simplifies database interactions by enabling non-experts to convert their natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero-shot text-to-SQL paradigm, existing methods face scalability challenges when dealing with massive, dynamically changing databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing across massive databases. Specifically, DBCopilot decouples the text-to-SQL process into schema routing and SQL generation, leveraging a lightweight sequence-to-sequence neural network-based router to formulate database connections and navigate natural language questions through databases and tables. The routed schemas and questions are then fed into LLMs for efficient SQL generation. Furthermore, DBCopilot also introduced a reverse schema-to-question generation paradigm, which can learn and adapt the router over massive databases automatically without requiring manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for real-world text-to-SQL tasks, providing a significant advancement in handling large-scale schemas.

추상화:텍스트-SQL은 비전문가들이 그들의 자연어(NL) 질문을 구조화된 질의 언어(SQL) 질의로 변환할 수 있도록 하여 데이터베이스 상호작용을 단순화한다. 최근 대규모 언어 모델(LLM)의 발전으로 제로샷 텍스트-SQL 패러다임이 개선된 반면, 기존 방법은 동적으로 변화하는 대규모 데이터베이스를 다룰 때 확장성 문제에 직면한다. 본 논문에서는 대용량 데이터베이스 간 라우팅을 위해 콤팩트하고 유연한 코파일럿 모델을 사용하여 이러한 문제를 해결하는 프레임워크인 DBCopilot을 소개한다. 특히 DBCopilot은 경량 시퀀스 투 시퀀스 신경망 기반 라우터를 활용하여 텍스트-SQL 프로세스를 스키마 라우팅 및 SQL 생성으로 분리하여 데이터베이스 연결을 공식화하고 데이터베이스 및 테이블을 통해 자연어 질문을 탐색한다. 그런 다음 라우팅된 스키마와 질문은 효율적인 SQL 생성을 위해 LLM에 입력됩니다. 또한 DBCopilot은 역 스키마-투-질문 생성 패러다임을 도입하였으며, 이는 방대한 데이터베이스를 통해 수동 개입 없이 자동으로 라우터를 학습하고 적응시킬 수 있다. 실험 결과는 DBCopilot이 실제 텍스트에서 SQL 작업을 위한 확장 가능하고 효과적인 솔루션으로 대규모 스키마를 처리하는 데 있어 상당한 발전을 제공한다는 것을 보여준다.

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-12-10] 오늘의 자연어처리 (1)	2023.12.10
[2023-12-09] 오늘의 자연어처리 (0)	2023.12.09
[2023-12-07] 오늘의 자연어처리 (1)	2023.12.07
[2023-12-06] 오늘의 자연어처리 (2)	2023.12.06
[2023-12-05] 오늘의 자연어처리 (1)	2023.12.05

잡다한 이야기

[2023-12-08] 오늘의 자연어처리

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

DBCopilot: Scaling Natural Language Querying to Massive Databases

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-12-08] 오늘의 자연어처리

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

DBCopilot: Scaling Natural Language Querying to Massive Databases

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바