Course

Natural Language Processing

Teacher(s)

Bas Donkers, Meike Morren
Research field

Data Science and Econometrics
Dates

Period 5 - May 06, 2024 to Jul 05, 2024

Course type

Core
Program year

First
Credits

4

Course description

Natural Language Processing (NLP) comprises statistical and machine learning tools for automatically analyzing text data to derive useful insights from it. Vast amounts of information are stored in this form, and hence NLP has become one of the essential technologies of the big data age. In this course, core concepts and techniques from the area will be studied, with a focus on methods that are popular in business applications. These include n-gram models, word vectors, sentiment analysis, word embeddings and topic modelling.

This course offers students a theoretically informed understanding of NLP. It aims at broadening the knowledge of the methods involved in NLP, as well as a hands-on experience with the steps that need to be taken in an NLP project. We focus on three aspects:

a) to create deep(er) understanding of the main methods in NLP (n-gram, lexicon approach, word embeddings and other advanced machine learning methods);
b) to obtain an experience to scrape and clean the data yourself;
c) to apply this knowledge and experience in a group assignment which gives you the possibility to show your creativity.

By the end of this course, you will be able to analyse and evaluate NLP approaches. Moreover, you will apply this knowledge and skills in a real-life setting, enabling you to translate and apply theoretical knowledge into practice.

Topics covered:

Information theory, regular expressions and scraping (tokenization, stemming, lemmatization, parsing).
Word vectors and dimension reduction based on bag of words (n-grams)
Sentiment analysis (lexicon-based vs model-based)
Word embeddings (Word2Vec, GloVe, BERT)
Topic models (LDA)

Course literature

The following list of mandatory readings (presented in alphabetical order) are considered essential for your learning experience. These articles are also part of the exam material. Changes in the reading list will be communicated on CANVAS. Papers marked with ** are obligated to discuss in Feedback Fruits.

Selected papers, per week:

Week 1

· Bowman, Samuel R. (2023) Eight Things to Know about Large Language Models (https://arxiv.org/abs/2304.00612)

Week 2

· Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022.**

· Büschken, J., & Allenby, G. M. (2016). Sentence-based text analysis for customer reviews. Marketing Science, 35(6), 953-975. **

· Hu, M., & Liu, B. (2004, July). Mining opinion features in customer reviews. In AAAI (Vol. 4, No. 4, pp. 755-760).

· Blei, David M, & John D Lafferty (2007). A correlated topic model of science. Annals of applied statistics. 1(1) 17-35.

Week 3

· Mikolov, T.,Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).**

· Dieng, A.B., Ruiz, F.J.R., & Blei, D. M (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics (2020) 8: 439–453.**

· Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

· Rong, X. (2014). Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

Week 4

· Shen, D., Wang, G., Wang, W., Min, M. R., Su, Q., Zhang, Y., ... & Carin, L. (2018). Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. arXiv preprint arXiv:1805.09843. **

· Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).**

Week 5

· Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.**

· Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. (2018) Improving Language Understanding by Generative Pre-Training. OpenAI. **

Week 6

· Sun, C., Qiu, X., Xu, Y., Huang, X. (2020) How to Fine-Tune BERT for Text Classification? https://arxiv.org/pdf/1905.05583.pdf. **

· Bender et al. (2021) On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp 610–623 https://doi.org/10.1145/3442188.3445922 **

· Vig, J. (2019) A Multiscale Visualization of Attention in the Tranformer Model. https://arxiv.org/pdf/1906.05714.pdf Corresponding Blog post is here: https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1

Books:

· Jurafsky, D., & Martin, J. H. (2022). Speech and language processing (Vol. 3). London: Pearson. Current version available for free here: https://web.stanford.edu/~jurafsky/slp3/