MineBioText - A Biomedical Literature Mining Tool

Biomedical Informatics (BMI) Is the emerging area that aims to put together Bioinformatics and Medical Informatics together. The mission of BMI is to provide the technical and scientific infrastructure and knowledge to allow evidence-based, individualised healthcare using all relevant sources of information. These sources include the "classical" information as currently maintained in the health record, as well as new genomic, proteomic and other molecular-level information. Aiming at a change from late stage diagnosis towards early detection or even prediction of disease, BMI bears the potential to foster discovery and creation of novel diagnostic and therapeutic methods, in order to improve the health and quality of life of the individual, as well as the efficiency of expenditure in healthcare systems.

Biomedical Text Categorization – BTC aims to the better retrieval of relevant text references, and improves the potential of knowledge discovery (i.e., gene/protein correlations) from the retrieved texts. In general we may refer to two main methods of text categorization:\n# Unsupervised method – the task here is the induce clusters of texts with high intra (within cluster) similarity, and with high inter (between clusters) dissimilarity\n# Supervised method – the task here is to devise a feature-based prediction model (procedure, metric or both) in order to predict the category in which a text belongs (a training phase is required, based on collections of pre-categorized text references).

''Literature Mining'' in the biomedical domain aims to identify and extract valid, novel, potentially useful and ultimate understandable novel nuggets of information and patters in scientific literature. It combines two technologies: Information Extraction (IE) & [[Text Mining]] (TM). IE identifies predefined classes of entities, relations and events that are explicitly mentioned in the literature. TM identifies non-trivial, implicit, previously unknown and useful patters in text which are not explicitly mentioned in the text.\n[img[Literature Data Mining|LiteratureDataMining.jpg]]

MineBioText

Information Retrieval is concerned with identifying documents that are most relevant to a user’s need within a very large set of documents. More precisely, given a large database of documents, and a specific information need - usually expressed as a query by the user, the goal of information retrieval methods is to find the documents in the database that satisfy the information need. Naturally, the task has to be performed accurately and efficiently.\n

Literature data mining is the process of identifying and extracting valid, novel and useful nuggets of information and patterns from scientific literature. It comprises two technologies; text mining and information extraction. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences.

MineBioText \n''[[Short Guide|http://www.ics.forth.gr/~potamias/MineBioText/MineBioText_Short_Guide.pdf]]'' \n[[Literature Mining]]

~MineBioText is a generic software tool for Data Mining in Biomedical Text Documents. ~MineBioText aims at:\n# Identification of gene/protein--gene/protein and gene/protein--disease associations following a Text Mining approach. The approach utilizes data-mining and statistical techniques, algorithms and metrics to tackle the problems of:\n## Identification and recognition of terms in text-references – based on an appropriately devised and implemented algorithmic process\n## Ranking of terms and their (potential) relations or, links – based on the MIM entropic metric (Mutual Information Metric) to measure the respective terms’ association strength.\n# Construction of a genes Association Network based on the assessed terms (i.e., genes, proteins, diseases) association strengths.\n# Categorization / Classification of text-references (mainly from the [[PubMed|http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed]] abstracts repository) into class categories utilizing an appropriately devised classification metric and procedure, and using the most descriptive (i.e, strong) associations between terms. Pre-assignment of text-references (i.e., [[PubMed|http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed]] abstract) to categories is performed by posting respective queries to [[PubMed|http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed]], i.e., querying [[PubMed|http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed]] with “breast cancer” the retrieved documents are considered to belong to the “breast cancer” category.\n# Assessment on the texts’ categorization / classification results – based on respective ~PubMed abstract collections, their pre-categorization and careful experimental set-up to measure prediction results, i.e., accuracy and precision.\n# Design and development of a Graphical User Interface GUI that encompasses all of the aforementioned operations with extra functionalities for setting-up the domain of reference and study, e.g., gene/protein and disease names, their synonyms and free-text descriptions, text collections, parameterization of build-in algorithmic processes etc. \n[img[MineBioText|MineBioText.jpg]]

A Biomedical Literature Mining Tool

MineBioText

Text Mining (TM) refers to the emerging research area that can be roughly characterized as knowledge discovery from large text collections, thus combining knowledge discovery and text processing methods. It uses techniques from the general field of data minin, but since it handles unstructured data, a major part of the process deals with the crucial stage of pre-processing the document collections; term extraction, and information extraction.\nIt is concerned mainly with the discovery of interesting patterns such as clusters, associations, deviations, similarities, and differences between terms, between documents, and between terms and documents.\n\nIt is also defined as the process of discovering and extracting knowledge from unstructured data, contrasting it with data mining, which discovers knowledge from structured data. Instead of leaving the user with the problem of having to read several tens of thousands of documents, text mining gives the possibility of extracting precise facts, and finding interesting associations among disparate facts, leading to the discovery of new or unsuspected and hidden knowledge in text references. Normally TM comprises three steps:\n# In the first step includes relevant text-references are collected, mainly based on [[Information Retrieval]] approaches.\n# In the next step, known as Information Extraction, identification and extraction of the information pieces (mainly terms or, small-phrases) from the (retrieved) texts is performed – this is done in accordance to user’s requests; in principal it is based on [[Information Retrieval]] techniques, and mainly on text parsing operations.\n# In the last step, mainly data-mining, machine learning and statistical techniques are employed in order to induce and identify associations among the pieces of the extracted information.\n[img[Text Mining|TextMining.jpg]]\nA potential view of Text Mining components and operations