Understanding Text Documents
Enabling the analysis of large sets of documents on the Cray XMTAt a glance
Statistical Textual Document Analysis (STDA) refers to the process of deriving high-quality information from text. This data is derived through the dividing of patterns and trends through means such as statistical pattern learning. A typical application is to scan a set of documents and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.
"Analyzing large textual document sets to extract topical information is a key technology that has many practical applications in national security, cybersecurity and science. Faster, scalable textual analytics enabled by the Cray XMT can help detect security threats, counter cyber attacks and help accelerate scientific discovery in areas like bioinformatics."- PNNL Task Lead Chad ScherrerWhat we do
At Pacific Northwest National Laboratory's (PNNL) CASS-MT, the primary objective with regards to this STDA task is enabling the analysis of large sets of documents.How we do it
Currently, the team focused on two methods using the Cray XMT in order to quickly process the data from raw documents to the output in analysis tools such as search engines or visualization programs.
The first method focuses on the “bag of words” strategy. A “bag of words” is a representation of a document that includes the distinct words that appear in it, together with the number of times they appear. This technique offers a way to optimize the intermediate representation of documents and enables higher-level analysis on large document sets.
This is a challenge on a cluster where multiple processors are being used at once because each processor has its own memory, but each is trying to refer to the same “index” for a distinct word. The Cray XMT offers a solution because it provides a shared memory for all processors- with lower costs for accessing shared data such as unique word indices.
The next step involves computing a reverse mapping between words and documents. Essentially, this can tell us for every distinct word in which documents does it appear, which can be very useful for searching.
- Testing document analysis with data from Wikipedia
- Applications in fast searching and indexing
- Information retrieval and management
- Searches in bioinformatics databases for gene or protein sequences
- Higher-level statistical analysis based on bag-of-words and reverse mapping
Chad Scherrer, Task Lead, PNNL
David Haglin, PNNL
Carina Lansing, PNNL
Tim Shippert, PNNL
Daniel Chavarria, PNNL