Natural Language Processing
-
Teacher(s)Bas Donkers, Meike Morren
-
Research fieldData Science
-
DatesPeriod 5 - May 02, 2022 to Jul 15, 2022
-
Course typeCore
-
Program yearFirst
-
Credits4
Course description
Natural Language Processing (NLP) comprises statistical and machine learning tools for automatically analyzing text data to derive useful insights from it. Vast amounts of information are stored in this form, and hence NLP has become one of the essential technologies of the big data age. In this course, core concepts and techniques from the area will be studied, with a focus on methods that are popular in business applications. These include n-gram models, word vectors, sentiment analysis, word embeddings and topic modelling.
This course offers students a theoretically informed understanding of NLP. It aims at broadening the knowledge of the methods involved in NLP, as well as a hands-on experience with the steps that need to be taken in an NLP project. We focus on three aspects:
a) to create deep(er) understanding of the main methods in NLP (n-gram, lexicon approach, word embeddings and other advanced machine learning methods);
b) to obtain an experience to scrape and clean the data yourself;
c) to apply this knowledge and experience in a group assignment which gives you the possibility to show your creativity.
By the end of this course, you will be able to analyse and evaluate NLP approaches. Moreover, you will apply this knowledge and skills in a real-life setting, enabling you to translate and apply theoretical knowledge into practice.
Topics covered:
- Information theory, regular expressions and scraping (tokenization, stemming, lemmatization, parsing).
- Word vectors and dimension reduction based on bag of words (n-grams)
- Sentiment analysis (lexicon-based vs model-based)
- Word embeddings (Word2Vec, GloVe, BERT)
- Topic models (LDA)
Course literature
- The following list of mandatory readings (presented in alphabetical order) are considered essential for your
learning experience. These articles are also part of the exam material. Changes in the reading list will be
communicated on CANVAS. Papers marked with ** are obligated to discuss in Feedback Fruits.
-
Selected papers, per week:
Week 2
- - Hu, M., & Liu, B. (2004, July). Mining opinion features in customer reviews. In AAAI (Vol. 4, No. 4, pp. 755-
760).**
- - Pang, B., Lee, L., & Vaithyanathan, S. (2002, July). Thumbs up?: sentiment classification using machine
learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language
processing-Volume 10 (pp. 79-86). Association for Computational Linguistics.**
-
Week 3
- - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine learning
research, 3(Jan), 993-1022.**
- - Blei, David M, & John D Lafferty (2007). A correlated topic model of science. Annals of applied statistics. 1(1)
17-35.
Büschken, J., & Allenby, G. M. (2016). Sentence-based text analysis for customer reviews. Marketing
Science, 35(6), 953-975. **
- Week 4
- - Mikolov, T.,Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-
3119).**
- - Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp.
1532-1543).**
- - Rong, X. (2014). Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.
- - Shi, T., & Liu, Z. (2014). Linking GloVe with word2vec. arXiv preprint arXiv:1411.5595.
Week 5
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I.
(2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).**
- - Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.**
- - Vig, J. (2019) A Multiscale Visualization of Attention in the Tranformer Model.
https://arxiv.org/pdf/1906.05714.pdf
- Week 6
- - Shen, D., Wang, G., Wang, W., Min, M. R., Su, Q., Zhang, Y., ... & Carin, L. (2018). Baseline needs more
love: On simple word-embedding-based models and associated pooling mechanisms. arXiv preprint
arXiv:1805.09843. **
- - Sun, C., Qiu, X., Xu, Y., Huang, X. (2020) How to Fine-Tune BERT for Text Classification?
https://arxiv.org/pdf/1905.05583.pdf. **
-
Books:
- - Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (Vol. 3). London: Pearson.
- - Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language
processing. MIT press.