Overview: Text Analytics is a powerful mechanism used to extract structured data from unstructured or semi structured text. This is done by creating rules. These rules are used by the extraction programs to extract the relevant information.
In this article we will talk about the Annotation Query language or AQL which is used for text analytics.
Introduction: IBM Infosphere is a platform used to analyze the business insights within a huge volume of data which is of diversified range. Usually these types of data are ignored because it becomes almost impossible to process such a volume of data using the traditional DBMS or RDBMS tools. Annotation query language or AQL is a query language used in IBM InfoSphere as a component to build extractors which can extract structured information from unstructured or semi structured content.
Components of Text Analytics:
- Input collection formats – Input collection is either a document or a set of documents which is used as an input text from where we are supposed to extract the information. Usually an input collection must be one of the following formats –
- UTF-8 encoded text file having any of the following extensions –
- .txt
- .htm or .html or .xhtml
- .xml
- A directory containing UTF-8 encoded text files.
- An archive file with the following extensions which contains UTF-8 encoded text files –
- .tar
- .zip
- .gz
- UTF-8 encoded comma separated file.
- A plain JSON file.
- UTF-8 encoded text file having any of the following extensions –
- Regular Expression – Regular expressions are most commonly used for text search mechanism. We can use regular expression builders which are used to construct regular expressions and sub expressions.
- Multilingual Support – Text analytics components has support for most common languages which are used for written communications. Text analytics is based on two major techniques – tokenization and parts of speech.
- Patterns – The pattern discovery feature groups input contexts which are similar or have a common pattern.
- Annotation Query Language or AQL – AQL is the primary language used for text analytics. This is used to build extractors which are then used to extract relevant information from unstructured textual components. This is more like SQL language.
Aspects of Text Analytics:
- Declarative language – A declarative language is used to identify and extract textual information from existing text content. Annotation Query Language or AQL enables us to have our own collections of records or views which matches a specified rule. These views are the main output of any AQL extractor. Views are used to display report on IBM Bigsheets. IBM Bigsheet is the inbuilt reporting and dashboard component of IBM Infosphere Biginsight platform.
- User defined dictionaries – Dictionary has the ability to identify certain text from an input text to extract the business insights. In AQL we can have our customized dictionary which will be helpful to get the desired result in an efficient manner.
- User defined rules – With the help of patterns and regular expressions we can specify rules or mechanism using which we can segregate the data from a large set of data.
Let’s consider the following example – we can mention certain keywords which may or may not appear within a given range of one another. E.g. consider the three words – “Apple”, “Mac” and “Steve”. If all these words appear within a defined range it becomes obvious that we are talking about Apple computers which was founded by Steve Jobs and Mac is used as the operating system here. But if the word “Waugh” appears right after the word “Steve” and the other two key words – “Apple” and “Mac” are not present, then it becomes clear that we are talking about the famous Australian cricketer – Steve Waugh.
- Tracking – The process of text analysis is an iterative process. It becomes necessary to modify the rules and other user defined dictionaries based on the results what we get out of the existing rules.
Text Analytics Process:
The text analytics process is carried out in the following four steps –
- Step 1 – Collecting and preparing sample data – Any application based on text analytics is developed with the help of some sample data. This sample data is created by having a subset of the bigger data which we have collected. Depending upon the format of our input data we need to prepare one or multiple formats of data which is supported by BigInsights. In the example mentioned above we look for the input keywords – “Apple”, “Mac” and “Steve”. These input parameters help the application to gather data from the websites which have these keywords mentioned.
- Step 2 – Developing the text extractor and test the same – BigInsights Plugins are available for the most commonly used Java IDE – Eclipse. Using the Eclipse based wizards we can easily develop the text extractors and test them. The BigInsights information center has all the information on the pre-requisite software which is required to develop the text extractors. On a broad level, the following steps needs to be carried out to create a text extractor on eclipse, once the BigInsights plugin is installed successfully –
- Create a new BigInsights project.
- Import the sample data which is required for testing. The sample data in our example is typically in a JSON array format. For our testing purpose let us use the Bigsheets export facility to export some records (around 10000) of data in a CSV file. Then we run the Jaql script. This script converts the CSV file into an appropriate delimited file format which is readable by BigInsights. This new file is then used as input file to the eclipse analytical tool.
- Create the artifacts which are required by the application e.g. AQL modules, AQL scripts, user defined dictionaries and so on.
- Now test your code against the sample documents based on the input collection provided. The built in features like annotation explorer and the log pane are used to inspect the results. This test should be carried out iteratively.
- Step 3 – Publish and deploy – The application is ready to be deployed and published when we are satisfied with the results which is produced by the text extractor. Usually it is published in the application catalog of a cluster. In order to deploy the published application we use the BigInsights web console. We should use a login id which has the administrative privileges.
- Step 4 – Run the text extractor – After deploying the text extractor successfully, it is now time to execute it. As we know BigInsights has the ability to invoke the text extractors using Java API with the help of Jaql and Bigsheets. The advantage of using Bigsheets is that there is no additional coding or scripting required here. Any Business Analyst can take up this task.
AQL Views:
There is nothing special about AQL views. These are similar to the standard views in a relational database. Each AQL view has a name, and consists of rows and columns. In AQL, views are always materialized. All the AQL statements operate on views. Here we have one special view called Document. This view is mapped to one input document at the time from your collection at runtime. This view is very helpful to extract the subset from the large set of data.
Summary: Text analytics is at the heart of any analytics application. So it is very important to learn the tools and frameworks required to develop text analytics applications. IBM Infosphere Biginsight is one of the best tools available for text analytics. Let us summarize our discussion in the form of following bullets –
- Text analytics is a powerful mechanism used to extract information from unstructured set of data.
- Major components of text analytics are –
- Input Collection format
- Regular expression
- Multilingual support
- Annotation Query Language or AQL
- Major aspects of text analytics are –
- Declarative language
- User defined dictionaries
- User defined rules
- Tracking