_ 1. Pre-processing: _

is about cleaning up text-data for future use, five typical steps including:

Tokenization : tokenization is a process to find boundaries between word-like entities in character string. Concretely, it helps to identify individual words, numbers, and other single coherent constructs. The nltk Python tokenizer works off a set of rules for handling things of different types, including rules are quite general (e.g. handing full tops), other rules are very specific (e.g. the film M*A*S*H can be treated as one entity).

Before doing text analytics, typically, we have some sequence of characters, we need to break them into word-like tokens/entities for future use. After tokenization, we can analyse things such as similarity, frequency and etc. through each entity. Also, in order to reduce minor difference between tokens, like accented differences, capitals and acronyms, we need do normalization. In nltk, it is to do "down case" operation to all tokens.

Fix misspelling: as we cannot assume all the words are spelt correctly, after tokenizing and normalization, we can run a spelling checker over the tokens. Standard solution of this is to take a dictionary, compare each word against it (shortest edit distance will be used to a word in dictionary). Python has a package called enchant is quite popular to do this. While doing the correction process, some probability theory (e.g. Bayes' Theorem) will be used, because there is no way to know for sure (e.g. should 'lates' be corrected to 'late' or 'latest'), we are trying to choose the most likely spelling correction for the specific word.

Stemming, lemmatization, pos-tagging: in text analytics, we need to recognise variations of the same stem or root (e.g. fish- for fishes, fishing or fished…). Only if when we convert the variations into the root form then we can have a more accurate manipulation of the word. Generally, stemming refers to the recovery of the root forms of words to show these commonalities and differences. Technically, stemming is described the removal of morphological affixes. Several algorithms are used to do this, such as Porter Stemmer, Snowball stemmers. Porter Stemmer is a quite brutal one, it works off a mixture of specific rules and general ones, (e.g. gets rid of plurals, '-ed' and '–ing').

However, stemmers are quite crude, they have no POS disambiguation and cannot handle irregulars (such as 'am', 'is', 'are'). If we distinguish two words that appear the same but are actually different, for example, fish-v and fish-n (stemming cannot really do this). Weneed to recognise when two words are the same but from different syntactic categories (or parts of speech, POS), fish the noun and fish the verb. So, here lemmatisation comes in.

Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item, from another perspective, it is an algorithmic process of determining the lemma for a given word. Compared with stemmer, the difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on POS, lemmatisation get better than stammering and can deal with irregular plurals (man and men).

However, lemmatising may need to know the part-of-speech (POS) in advance. So, POS-tagging comes in, which is a process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context. Something like: ("the", "DT), ("man", "NN") ……, these outputs can be used in conjunction with lemmatisation or part of them to get more accurate word identifications.

Overall examples,
1. the word 'better' has 'good' as its lemma. Missed by stemming. 2. the word 'meeting' can be either the base from o a noun or a form of a verb, e.g. 'in our last meeting' or 'we are meeting again here'. Unlike stemming, lemmatisation can select the appropriate lemma depending on the context.

Entity extraction: identifying conceptual entities behind words, which is the most semantic aspect of pre-processing. Here, we identify the actual conceptual entity referred to by the word. For example, I.B.M => IBM (the organization).

Removing stop words: Maximising the content-full words in the document / corpus.

Above all we have been mainly transforming words in various ways to make them better for use. But, some words do not help, like stop words. They are words that do not convey much content, and also often very frequent and can affects norms and counting. There is no definitive stop-word set, may vary for different purposes, commonly like "a, an, and, at, be …". Removing stop words can result in large reductions (40%-60%) in normal texts, so we can focus on analysing core contents. However, in Twitter, everything is different, stop-word removal may damage performance.

_ 2. Similarity: _

three broad families of similarity techniques used in text analytics are:

The one you use depends on the task, what text-features you are using, how you are measuring, weighting that feature (e.g. binary, count, tf-idf score).

Feature similarity techniques are by far, the most commonly use form of similarity, can be divided into two broad areas, one for "binary, presence / absence" (including Jaccard index and Dice coefficient); the other for "frequencies in VSMs (Virtual Space Models)", including Cosine similarity, Euclidean distance and Manhattan distance. Here, I would take Jaccard index as an instance to explain. Intuitively, we know that if two things have a certain number of features in common, then they are more similar than two things that have fewer features in common; especially, if they do not have many different features. It is the idea of Jaccard index, it measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

And the Jaccard distance measures dissimilarity between sample sets, is obtained by subtracting the Jaccard Coefficient from 1.

Compared with dice coefficient, as Jaccard is a distance measure, it does have the property of triangle inequality , which means d (i, j) <= d (i, k) + d (k, j).

Look at table above, we find "stormcock" should be one word in base, however be divided into two keywords in 'target3', this tells us pre-processing is important, otherwise unexpected issues might arise.

And for feature similarity in VSMs, such as using cosine similarity and Euclidean distance in VSM.

Step1. Pre-process the terms in your text-items for the matrix and decide on its structure(rows and columns).

Step2. Compute the frequency counts of the terms (for TF) and document counts (for IDF) to get TF-IDF.

Step3. Apply your similarity method to vectors in your matrix.

By doing this, are getting the distance between two vectors that characterise two things in some n-dimensional space, the greater the distance the less similar they are to one another, the smaller the distance the more similar. However, a lot of information is thrown away (doc is a bag of words - terms are statistically independent, the order of words is also important), and long documents are poorly represented because they have poor similarity values. Also, it lacks of semantic sensitivity, documents with similar context but different term vocabulary won't be associated.

Featural similarity is based on the intuition that if two things "look" the same they are the same; But, similarity is not only skin-deep; two things may look the same but have different deep structure.

Transformational similarity is based on the idea if you can turn one thing into another easily then they are similar. "turn one thing" implies you can apply some procedure/transformation/some set of steps to change one thing into the other. Basically, transformational similarity (using Levenshtein / edit distance) is the smallest number of transformations (delete, insert, substitute) required to turn one thing into the other. For example, this technique is used in plagiarism checking and detection, compare the edit distance to initial threshold.

Issues with Edit Distance:

Only good for small text-items, complexity O(n2) gets big with longer strings; Generally, less commonly used type of similarity but very powerful in some cases (string distance, tweet differences, plagiarism).

As mentioned before, featural approaches and various bag-of-words approaches throw away a lot. In language, this might be the syntactic structure of the sentence:
"man bites dog" and "dog bites man". The type of structural similarity can be glossed as a special case of transformational. The problem is your often need to buy into more complex representations, caused it to fall out of favour in the face of massive success of VSM approaches. One thing commonly used in structural similarity is analogy, but it is limited use in text analytics, everything thing is dominated by vector-space models.

_ 3. Supervised & Unsupervised learning: _

In supervised learning, the class labels are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to. However, in unsupervised learning, the class labels of data are unknown, and it let the learning algorithm itself to figure out patterns in the data provided.

Supervised learning can be further grouped into regression and classification problems. In text analytics, classification is used to identify an item as a member of a class and/or to make predictions about that item, including algorithms like: k nearest neighbours, naïve bayes, logistic regression and support vector machine. It used to determine is article about sports or business; is this email spam or non-spam. K nearest neighbour is one of the representative algorithms of classification problems. It requires us to pre-specify a k value, given a training set D, the algorithm compute the distance (or similarity) between the certain test object to all the objects in D to determine its nearest-neighbor list (where list number equals to k), then apply majority voting to determine the class label for the test object. In addition, rather than simple voting of neighbours for one class, you can also measure the distance of items from one another (e.g., cosine similarity/euclidean distance over some tf-idf vector). So, if 10 neighbours have 5 from class-A that are all closer than the 5 from class-B; then weighting the voting on distance will give better classification.

In literature, Yang, Y. (1999) evaluated different kinds of statistical approaches to text classification. Classic use is in text processing. Main idea of their work is to apply different classifiers to take docs and classify them into news categories (sport, current affairs) or medical categories (vascular, psychiatric) and evaluate the performances. They first of all, did aggressive pre-processing: stop-word removal, stemming, word weighting to achieve an 85% vocabulary reduction; and then docs cast as tf-idf vectors, different classifiers used.Large scale test across several corpora and techniques shows k-NN to do surprisingly well.

Clustering is an example of unsupervised learning in which different data sets are clustered into groups of closely related items. And clustering can be further grouped into k-means (partition clustering), hierarchical clustering and graph-based clustering. K-means is a representative technique, it works:

  1. randomly assign some points in m-d space

  2. Get distance of all instances from these points, assign to group if closest to point

  3. Move the points to the centre of these groups

  4. Repeat until you're no longer moving points

As k-means requires us to give some initial points, the result of each run might be different, that's also why it may not find the global optimum. In practice, often do 1000 iterations of 2-points, 3-points, etc.; and find the most persistent group that emerges. In addition, k-means is not suitable for clusters with non-convex shapes. As it needs to compute Euclidean distance, it is sensitivity to noisy data and outliers.

In literature, Feng and his teammates (2002) using k-means in "new sensitive stock trend prediction", they gathered articles within stock prices rising or falling period and clustered using k-means, after that they compute similarities between news-article group to relate group to rise/fall trend. They found a strong relationship between the frequency of news announced and the resulting profit, and reported this model can be applied to predict upcoming stock trend.

_ 4. Evaluation: _

Evaluation process in text analytics system tries to verify a trained classifier on unseen data, and Compare the performance of two or more classifiers on similar problems. For example, to evaluate whether the system can get the right answer (such as assess whether the system find all the topics in the corpus); while analysing the tweets, whether the system can get a good summary; or doing sentiment analysis, whether the system can correctly determine a word is positive, negative or not.

Ground Truth (GT): Ask people for the right answer(s), which items are correct/relevant. For example, in the past, people developed to test indexing schemes and book search in physical libraries. The judgement methodology of this task is made by experts, people took a set of books and say they want to find French dictionary, find books and get experts to judge are these the ones wanted. However, as IT system explode in IR, topic detection and so on, every aspect of this method gets larger. It becomes very hard for experts to look at the items to make judgements. People don't look at ground truth now, instead, two main innovations pooling and crowdsourcing are widely used.

Precision : Fraction of output items that are correct versus incorrect, using GT. Also means proportion of retrieved results that are relevant. In the evaluation process, when we constructed a confusion matrix like below:

Precision = TP / TP+FP. In real word take search engine as an example, given a collection of 100k documents, we want to find all documents on a specific topic. In fact, 50 relevant documents actually exist. We performed a search, 9 out of 10 results on the first page are relevant documents.

So, here the precision of the system is 910 = 90%, which means 90% of retrieved results are relevant.

Recall : Fraction of output items correct in GT. Also means proportion of relevant results that are retrieved.

Similarly, in the evaluation process, we constructed a confusion matrix. And the recall = TP/TP+FN, which is also called True positive rate. Take the same example in Precision, as we performed a search, we get 9 out of 10 results on the first page are relevant document. In fact, 50 relevant documents actually exist, so recall = 950, which means 18% of the relevant results are retrieved.

F-measure : One number for precision & recall, it measures a combination of precision and recall, also called the weighted harmonic** mean** of precision and recall. F1 gives equal weighting to both; F2 sometimes used to weight recall more, F0.5 sometimes used to weight precision more than recall. F1 is a widely-used measure, where F1 = 2 Precision * Recall / Precision + Recall, in the example above, we calculated precision and call, now we can get the F1 of this system is F1 = 2*0.9*0.18 / 0.9+0.18 = 0.3. As it is a measure used to summarise what is happening (the trade-offs) in Precision and Recall, the f1=0.3 here indicates that the system might be not good as we expected.

5. Sentiment Analysis:

Human Ratings : of sentiment terms/phrases/ documents. Get experts/crowd-sourced-non-experts to judge texts/phrases/words being positive or negative. Extrapolate (in some way) from these ratings to some wider set of matching terms. Rarely used directly but to support ground truth for other tasks. For example, in SemEval-2007 Affective Text Task, 6 Expert raters to annotate news headlines, one six emotions (Anger, Disgust, Fear, Joy, Sadness, Surprise) and valence (+ or -), they used expert ratings to determine best system at the Affective-Text-Task. Some statistics for inter-rater agreement can be apply to evaluate how you determine correct answer in human ratings, such as Pearson correlation and Cohen's Kappa, it systematically detecting bias between groups of judges.

In fact, human ratings are seldom used (only really used for ground truth), because it could not realistically rate all the items in a million-item corpus or in a Tweet stream.But, it has been used to identify docs/ phrases/features for sentiment classifiers. Also, things like Amazon's Mechanical Turk (AMT) Human Intelligence Tasks (HITs) raised the possibility of using non-experts rating.

In conclusion, human ratings are not practically useful for largescale, regular sentiment identification;But, expert and non-expert ratings can form the basis for other systems;Indeed, they lie behind Sentiment Lexicons and Classifier solutions;Methods for rater reliability are also generally important and useful.

Sentiment Lexicons : Matching text bits to sentiment word lists and lexicons (built from human judgments). It is a way to develop lists of sentiment words, to build a lexicon.Since, 1960s, social sciences have built such lexicons. For example, Harvard List, proved to be reasonable in identifying positive/negative texts, but still fairly minimal; MPQA Corpus, a _subjectivity_lexicon, divided (words) into strong-subj and weak-subj and classified with polarity. And WordNet, a High quality lexical resource, for nouns, verbs, adjs; does not define the meaning of words but POS and synsets.

In conclusion, the use of sentiment lexicons plugs sentiment analysis into very large lexicon, used by many systems and solved a lot of the problems. However, also have problems like, unreliability of human ratings and/or automated extensions of lists; inherent positive and negative terms, modified in context (e.g. it's not a bad film).

Sentiment Classifiers : Classifying texts to find sentiment features (built from human ratings). Two ways to use classifiers, one is supervised, train a classifier on known positive or negative docs and extract pos/neg features from it; the other is unsupervised, use probabilistic word-similarity (e.g. PMI – Pointwise Mutual Information) to know positive/negative words to classify a doc as p/n.

This is often the preferred solution to sentiment analysis. But, there are also issues (e.g. learning takes a long time; what feature have been learned sometime seem weird; Generalisation to other areas, often poor).

Add. Frequency:

Weighting Words: TF-IDF

Finding Discriminating Words: LLRs

Co-occurrence & Collocation: (Pointwise) Mutual Information (PMI)

Redundancy & Interestingness: Entropy (normalised and un-normalised)

Term-Frequency (TF) refers to the count of a term (word) in a given document (text-item),is still fairly crude, even using different options. The cleverness in TF-IDF is that the IDF bit is used to weight the frequency using rarity.A term that is really frequent in a given document, but really rare in every other document is not-equal-to a term that is really frequent in every document (e.g. A term that is in every text-item has no weight). Document-Frequency (DF) refers to the count of a term's (word's) occurrence in a set of documents (text-items), Inverse Document Frequency (IDF) uses the size of the document set, and the frequency of the term's (word's) occurrence in set of documents (text-items)

It modifies the raw frequency by the overall rarity of term in the corpus of documents.Does not distinguish repeated-but-important (tea) from repeated-but-not-important terms (the, and); hence, stopwords removal can become important.

A likelihood ratio test is a statistical test used to compare the fit of two models, one of which (the null model) is a special case of the other (the alternative model), which helps to find words that stick out (rare) but indicate sth. about a text-item. Chi square is used to investigate whether distributions of categorical variables differ from one another.

We have looked at just counting individual items (words, word-pairs, emoticons), but often want to find out if certain words co-occur together.Association measure of two items: considering how likely are they to be found together or apart in some larger set of items (words, etc.), tells us how informative the occurrence of one word is about the occurrence of another, words that are highly informative about one another form a collocation.Mutual Information (MI) and Pointwise Mutual Information (PMI): use observed frequency of the pair and an expected frequency of the pair under the assumption that the parts are independent

Beyond word-frequencies we can examine the distribution of those frequencies to see if a text-item (or corpus of text-items) is redundant or interesting;If a text-item has many words with the same frequencies (flat distribution) then it is likely to be about a single topic and quite repetitive;If a text-item has many words with the same frequencies (flat distribution) then it is likely to be about a single topic and quite repetitive.Entropy is defined as the sum of the probability of each label times the log probability of that same label.

In conclusion, TF-IDF weighted word frequencies in a corpus; LLR helped us find exceptional terms / words; PMI allows us to find words that "go together"; Entropy tells us about redundancy or randomness / interestingness in a text-item.