Summary of Text Analytics Module (COMP47600 in UCD)
_ 1. Pre-processing: _ is about cleaning up text-data for future use, five typical steps including: –Tokenization & Normalization –Fix misspelling –Stemming, lemmatization, pos-tagging –Entity extraction –Removing stop words Tokenization : tokenization is a process to find boundaries between word-like entities in character string. Concretely, it helps to identify individual words, numbers, and other single coherent constructs. The nltk Python tokenizer works off a set of rules for handling things of different types, including rules are quite general (e.