[2023-07-06] 오늘의 자연어처리

Estimating Post-OCR Denoising Complexity on Numerical Texts

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

OCR 후 처리는 지난 몇 년간 상당히 개선되었다. 그러나, 이것들은 주로 자연적으로 구성된 텍스트에 도움이 되었다, 알파벳 순서의 단어들, 다음과 같은 숫자적인 성격의 문서들과는 대조적이다 송장, 급여 명세서, 진단서 등. OCR을 평가하는 방법 이러한 데이터 세트의 후처리 어려움, 우리는 추정하는 방법을 제안한다 텍스트의 노이즈 제거 복잡성 및 여러 데이터 세트에서 평가 다양한 성격, 그리고 수치적 성격의 텍스트가 중요하다는 것을 보여준다 불이익을 주다. 우리는 추정 복잡도 순위를 평가한다 우리의 타당성을 보여주기 위한 현대 소음 제거 접근법의 오류율 추정치.

VOLTA: Diverse and Controllable Question-Answer Pair Generation with Variational Mutual Information Maximizing Autoencoder

Previous question-answer pair generation methods aimed to produce fluent and meaningful question-answer pairs but tend to have poor diversity. Recent attempts addressing this issue suffer from either low model capacity or overcomplicated architecture. Furthermore, they overlooked the problem where the controllability of their models is highly dependent on the input. In this paper, we propose a model named VOLTA that enhances generative diversity by leveraging the Variational Autoencoder framework with a shared backbone network as its encoder and decoder. In addition, we propose adding InfoGAN-style latent codes to enable input-independent controllability over the generation process. We perform comprehensive experiments and the results show that our approach can significantly improve diversity and controllability over state-of-the-art models.

이전의 질문-응답 쌍 생성 방법은 유창한 것을 생성하는 것을 목표로 했다 의미 있는 질문-응답 쌍이지만 다양성이 떨어지는 경향이 있다. 최근. 이 문제를 해결하려는 시도는 낮은 모델 용량으로 인해 어려움을 겪거나 지나치게 복잡한 건축. 게다가, 그들은 문제를 간과했다 모델의 제어 가능성은 입력에 크게 의존한다. 이 점에서. 논문, 우리는 VOLTA라는 모델을 제안한다. VOLTA는 생성적 다양성을 향상시킨다 공유 백본 네트워크와 함께 Variational Autoencoder 프레임워크 활용 그것의 인코더와 디코더. 또한 InfoGAN 스타일 잠재력을 추가할 것을 제안한다 생성 프로세스에 대한 입력 독립적인 제어 가능성을 활성화하는 코드입니다. 우리는 포괄적인 실험을 수행하고 결과는 우리의 접근 방식이 첨단 기술에 비해 다양성과 통제 가능성을 크게 개선합니다 모델들.

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. As a consequence, in the fields of biological, medical and clinical research, domain experts have to sift through massive amounts of scientific text to find relevant information. However, this process is extremely tedious and slow to be performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. In this work, we present the design, implementation and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Our system is publicly available on the web at this https URL

연구 수단과 계산 방법론의 확산으로, 출판된 생물 의학 문헌은 수와 양에서 기하급수적으로 증가하고 있다. 그 결과 생물학, 의학, 임상 연구 분야에서, 도메인 전문가들은 찾기 위해 방대한 양의 과학 텍스트를 체로 쳐야 한다 관련 정보. 그러나 이 과정은 매우 지루하고 느리다 인간에 의해 수행됩니다. 따라서, 새로운 컴퓨터 정보 추출과 의미 있는 지식 추출을 촉진하기 위해서는 상관 메커니즘이 필요하다. 이 작업에서 우리는 소설의 설계, 구현 및 적용을 제시한다 데이터 추출 및 탐색 시스템. 이 시스템은 깊은 의미론을 추출한다 과학 문헌에서 풍부하게 하기 위한 텍스트 실체들 사이의 관계 암세포주 영역에 있는 기존의 구조화된 임상 데이터. 우리가 자동 링크를 가능하게 하는 새로운 공공 데이터 탐색 포털을 소개합니다 순위가 지정된 관련 도면요소(예 영향을 받은 유전자. 각각의 관계는 문헌에서 파생된 증거를 수반한다, 기존의 구조화된 데이터를 사용하여 심층적이면서도 신속한 문헌 검색을 허용 도약대로서. 우리 시스템은 웹에서 공개적으로 사용할 수 있다 이 https URL

'오늘의 자연어 처리' 카테고리의 다른 글

[2023-07-08] 오늘의 자연어처리 (0)	2023.07.08
[2023-07-07] 오늘의 자연어처리 (0)	2023.07.07
[2023-07-05] 오늘의 자연어처리 (0)	2023.07.05
[2023-07-04] 오늘의 자연어처리 (0)	2023.07.04
[2023-07-03] 오늘의 자연어처리 (0)	2023.07.03

잡다한 이야기

[2023-07-06] 오늘의 자연어처리

Estimating Post-OCR Denoising Complexity on Numerical Texts

VOLTA: Diverse and Controllable Question-Answer Pair Generation with Variational Mutual Information Maximizing Autoencoder

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

'오늘의 자연어 처리' 카테고리의 다른 글

댓글

티스토리툴바

[2023-07-06] 오늘의 자연어처리

Estimating Post-OCR Denoising Complexity on Numerical Texts

VOLTA: Diverse and Controllable Question-Answer Pair Generation with Variational Mutual Information Maximizing Autoencoder

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

'오늘의 자연어 처리' 카테고리의 다른 글

관련글

댓글

티스토리툴바