SCIE PDF Text Extractor · This is an optimized version of Apache PDFBox. It allows to extract the rough structure of a document (pages, blocks of text and paragraphs as well as formatting information) and was made with the intent to optimize text extraction results for scientific papers. The output can easily be transformed to plaintext (toString) or to an XML format (toXML).
Group: de.cit-ec.scie - All Dependencies
SCIE PDF Text Extractor GUI · This provides an easy Graphical User Interface for the SCIE pdf-extractor module.
NER Webservice · Contains a debugging version of the SCIE Webservice, performing only ontology based Named Entity Recognition. Thus this webservice can be used to list the all the ontological named entities found in the input text.
SCIE Core · Contains the SCIE main application and the CLI interface. This project integrates the named entity recognition (NER), the PDF import and the classification and interfaces with the UIMA framework. The command line interface can be used to produce a set of UIMA XCAS files.
SCIE Classifiers · Library based on liblinear which allows to aggregate multiple UIMA annotations to compound UIMA annotations/higher order concepts/ relations by employing machine learning techniques.
NER Import · Tool used to import ontologies from various file formats (native simple XML used for the small ontologies, NCBI MeSH, NCBI Taxonomy) into the internal NER ontology database.
Webservice · Module providing the webservice interface based on the Jetty embedded webserver and the FreeMarker template engine. Defines a simple format for providing textual annotations and produced output in HTML or JSON. This module has no dependencies to the other SCIE modules (except for the PDF text extractor) or the UIMA framework and thus can be used in any context, where text is annotated by an algorithm and should be presented to an end user.
SCIE Webservice · Contains the SCIE Webservice. This application will spawn a multiple instances of the scie-core application in a process pool, relay requests from the web frontend to the analysis process and parse the resulting XCAS into an interactive HTML output or JSON.
NER MapDB · Provides a binding between the NER subsystem and the MapDB database for storing large ontologies, capable of managing hundred thousands of individual surface forms and ten thousands of ontology graph node.
SCIE Type System · This is an internally used library containing the UIMA type system descriptors and the annotator templates for the SCIE project.
NER Core · This module forms the main component of the ontology-based named entity recognition (NER). It can store arbitrary directed ontology graphs and supports multiple labels (ontological surface forms) per ontology graph node. It implements an easy to use class NamedEntityRecognition which can be used to (fuzzily) find the ontology instances in the text. This module has no dependencies.