Text analysis is essentially the processing and representation of data that is in text form for the purpose of analyzing and learning new models from it. The main challenge in text analysis is the problem of high dimensionality. When analyzing a document every possible word in the document represents a dimension. The other major challenge with text analysis is that the data is unstructured.
Text Analysis Process:
In a typical environment, the problem of Text Analysis is solved using the following steps:
- Search / Retrieval
- Text – Mining
Parsing is the process step that takes the un-structured or a semi-structured document and imposes a structure for the downstream analysis. Parsing is basically reading the text which could be weblog, a RSS feed, a XML or a HTML file or a word document.
Once parsing is done, the problem focuses on search and/or retrieval of specific words or phrases or in finding a specific topic or an entity (a person or a corporation) in a document or a corpus (body of knowledge). All text representation takes place implicitly in the context of the corpus. All search and retrieval is something we are used to performing with search engines such as Google.
With the completion of these two steps, the output generated is a structured set of tokens or a bunch of key words that were searched, retrieved and organized. The third task is mining the text or understanding the content itself. Instead of treating the text as set of tokens or keywords, in this step we derive meaningful insights into the data pertaining to the domain of knowledge, business process or the problem we are trying to solve.
Statistical Analytical Tools and Techniques in Text Analysis:
Many of the techniques such as clustering and classification can be adapted to perform text mining, with the proper representation of the text. K-means clustering or other methods can be used to tie the text into meaningful groups of subjects. Sentiment Analysis and Spam filtering are also examples of a classification tasks in text mining. In addition to traditional statistical methods, Natural Language processing methods are also used in this phase.
Applications of Text Analysis:
In sentiment analysis using unstructured data from Social Media and includes the following:
- Mention about the brand
- Mention about the product
- Positive / Negative Sentiments
- Product comparison vis-à-vis competitor products