All Posts

Summary of Text Analytics Module (COMP47600 in UCD)

_ 1. Pre-processing: _ is about cleaning up text-data for future use, five typical steps including: –Tokenization & Normalization –Fix misspelling –Stemming, lemmatization, pos-tagging –Entity extraction –Removing stop words Tokenization : tokenization is a process to find boundaries between word-like entities in character string. Concretely, it helps to identify individual words, numbers, and other single coherent constructs. The nltk Python tokenizer works off a set of rules for handling things of different types, including rules are quite general (e.

Basics you need to know in Machine Learning

1. Eager vs Lazy Classifiers Eager Learning Classification Strategy Classifier builds a full model during an initial training phase, to use later when new query examples arrive. More offline setup work, less work at run-time. Generalise before seeing the query example. Lazy Learning Classification Strategy Classifier keeps all the training examples for later use. Little work is done offline, wait for new query examples. Focus on the local space around the examples.