Fan Li

Summary of Text Analytics Module (COMP47600 in UCD)

@Fan Li · Dec 20, 2017 · 17 min read

_ 1. Pre-processing: _ is about cleaning up text-data for future use, five typical steps including: –Tokenization & Normalization –Fix misspelling –Stemming, lemmatization, pos-tagging –Entity extraction –Removing stop words Tokenization : tokenization is a process to find boundaries between word-like entities in character string. Concretely, it helps to identify individual words, numbers, and other single coherent constructs. The nltk Python tokenizer works off a set of rules for handling things of different types, including rules are quite general (e.

Basics you need to know in Machine Learning

@Fan Li · Dec 20, 2017 · 16 min read

1. Eager vs Lazy Classiﬁers Eager Learning Classiﬁcation Strategy Classiﬁer builds a full model during an initial training phase, to use later when new query examples arrive. More oﬄine setup work, less work at run-time. Generalise before seeing the query example. Lazy Learning Classiﬁcation Strategy Classiﬁer keeps all the training examples for later use. Little work is done oﬄine, wait for new query examples. Focus on the local space around the examples.