July 2005 
Volume 04, Issue 3 
Tech Talk

Text mining, for golden results

Boeing Tech Fellows have produced reams of research reports over the years. Imagine being tasked with organizing that information manually, tabulating it and finding correlations between concepts contained within it. It sounds like a daunting task. But with text mining tools developed by Boeing mathematicians and technologists, it'd be a snap.

Text mining is a family of technologies for analyzing large amounts of text. Text mining allows users to input massive amounts of seemingly disorganized data into a computer, run it through mathematical analysis processes and gain insight into the underlying organization of the data.

There are a couple schools of thought on analyzing text by computer, said Rod Tjoelker, a Mathematics and Computing Technology manager in Phantom Works. One is called natural language processing, which looks at grammar and sentence structure. Natural language processing essentially requires a computer to "read" and understand how words relate to other words. Foreign languages must be modeled separately.

The other school of thought is statistical text mining. Anne Kao, a member of Tjoelker's group, leads a research team that invented a method of casting words and documents into a mathematical framework to capture the correlations between words and identify the concepts being described in the documents. This approach takes advantage of these correlations in the data to relate similar ideas expressed with different words, rather than a dictionary as is required with natural language processing. As a result, different languages are not a barrier to analyzing content.

"By converting text to a mathematical format, the algorithms we developed to analyze the data will quickly notice that you are talking about similar things even though you are using different terms," Tjoelker said.

The basic underlying technology developed by Kao and her team is called "TRUST"—Text Representation Using Subspace Transformation—and it has several applications. TRUST can be used for knowledge management, classification or categorization of large volumes of data, and establishing consistency among data. This tool is particularly effective when trying to organize large amounts of information, replacing cumbersome manual processes.

Another way TRUST can be used is to assist in retrieving information. Because TRUST learns which terms are strongly associated with each other, it can find documents related to a keyword even if the keyword isn't in the document.

There are many benefits to text mining. Indeed, the biggest benefit is the incredible amount of time saved from not having to manually—or mentally—organize data. Another benefit is the ability to access information readily as the computer is a central repository of that data.

The technology also ensures concepts are represented consistently, helps users find synergies they might not otherwise have noticed, and organizes the data for users to search easily.

Boeing has been using text mining in a number of ways in recent years: Tech Fellows use a text classifier to organize a database of technology experts in the company, categorizing information from different sources. Shared Services is using it to create a Technology Bookshelf, a global resource of technology solutions. In recent years, the written comments in the Boeing Employee Survey have been analyzed using TRUST to determine the major topics employees are commenting on.

In addition to internal use, Boeing has licensed TRUST to other companies, including Battelle's Pacific Northwest National Labs. Battelle has a product called Starlight, which is a 3-D interactive visualization tool to analyze and combine text and geographical information.

