August 2, Thursday
10:00 – 11:00
a) Medical text uses many technical terms to refer to anatomy, biology or diseases. Most Medical-NLP tools rely on the UMLS, a medical vocabulary with over 300K unique terms and more than 1M synonyms. We present a method for automatically creating a Hebrew-UMLS lexicon. We show that acquiring this resource reduces the error for the NLP tasks of segmentation and Part of Speech (POS) tagging. We examine the impact of this improvement on a classification task: identifying patients with Epilepsy from the notes of the Children Neurology Unit in Soroka, resulting in F1 improvement from 92% to 96%.
b) EHR text is characterized by high-level of copy-paste redundancy in medical notes of the same patients. We quantatively show that this type of redundant word distribution is highly prevalent in both Hebrew and English notes and empirically demonstrate that this characteristics of medical notes collections has a deleterious effect on classical NLP algorithms. We present a novel algorithm for Topic Modeling with Latent Dirichlet Allocations (LDA) which is immune to the redundancy noise. This algorithm also performs better than the baseline for redundant news reports clusters.
c) Syntactic dependency parsing is a useful technique for Information Extraction, widely used in the biomedical domain. However, syntactic parsers suffer a major decline in accuracy when used in a domain different from the training data. We present a method for using Selectional Preferences, the affinity of different word pair or triplets, modeled with LDA to improve dependency parsing using un-annotated data in the target domain with significant improvement.
Taken together, the techniques provide infrastructure which allows practical processing of medical text in Hebrew. We make available a first set of language resources for Hebrew medical text processing (treebank, lexicon, part of speech tagger, syntactic parser, topic modeling toolkit). This infrastructure has been applied for practical text mining of hospital patient reports.