Cs 224 N 2019

winter-2019

课程资料：

学习笔记参考：

CS224n-2019 学习笔记

斯坦福CS224N深度学习自然语言处理2019冬学习笔记目录

参考书：

Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft)
Jacob Eisenstein. Natural Language Processing
Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning

神经网络相关的基础:

Michael A. Nielsen. Neural Networks and Deep Learning
Eugene Charniak. Introduction to Deep Learning

Lecture 01: Introduction and Word Vectors

The course (10 mins)
Human language and word meaning (15 mins)
Word2vec introduction (15 mins)
Word2vec objective function gradients (25 mins)
Optimization basics (5 mins)
Looking at word vectors (10 mins or less)

课件

cs224n-2019-lecture01-wordvecs1
- WordNet, 一个包含同义词集和上位词(“is a”关系) synonym sets and hypernyms 的列表的辞典
- 在传统的自然语言处理中，我们把词语看作离散的符号，单词通过one-hot向量表示
- 在Distributional semantics中，一个单词的意思是由经常出现在该单词附近的词(上下文)给出的，单词通过一个向量表示，称为word embeddings或者word representations，它们是分布式表示(distributed representation)
- Word2vec的思想
cs224n-2019-notes01-wordvecs1
- Natural Language Processing.
- Word Vectors.
- Singular Value Decomposition(SVD). (对共现计数矩阵进行SVD分解，得到词向量)
- Word2Vec.
- Skip-gram. (根据中心词预测上下文)
- Continuous Bag of Words(CBOW). (根据上下文预测中心词)
- Negative Sampling.
- Hierarchical Softmax.

Suggested Readings

Word2Vec Tutorial - The Skip-Gram Model (该博客分为2个部分，skipgram思想，以及改进训练方法：下采样和负采样)
理解 Word2Vec 之 Skip-Gram 模型(上述文章的翻译)
Applying word2vec to Recommenders and Advertising (word2vec用于推荐和广告)
Efficient Estimation of Word Representations in Vector Space (original word2vec paper)(没太看懂，之后再看一遍)
Distributed Representations of Words and Phrases and their Compositionality (negative sampling paper)

参考阅读

[NLP] 秒懂词向量Word2vec的本质(推荐了一些很好的资料)
word2vec Parameter Learning Explained
基于神经网络的词和文档语义向量表示方法研究
word2vec中的数学原理详解
网易有道word2vec(词向量相关模型，word2vec部分代码解析与tricks)

Assignment 1：Exploring Word Vectors

[code] [preview]

Count-Based Word Vectors(共现矩阵的搭建, SVD降维, 可视化展示)
Prediction-Based Word Vectors(Word2Vec, 与SVD的对比, 使用gensim, 同义词,反义词,类比,Bias)

笔记整理

word2vec的思想、算法步骤分解、代码

Lecture 02: Word Vectors 2 and Word Senses

Finish looking at word vectors and word2vec (12 mins)
Optimization basics (8 mins)
Can we capture this essence more effectively by counting? (15m)
The GloVe model of word vectors (10 min)
Evaluating word vectors (15 mins)
Word senses (5 mins)

课件

Gensim word vector visualization[code] [preview]
cs224n-2019-lecture02-wordvecs2
- 复习word2vec(一个单词的向量是一行；得到的概率分布不区分上下文的相对位置；每个词和and, of等词共同出现的概率都很高)
- optimization: 梯度下降，随机梯度下降SGD，mini-batch(32或64,减少噪声，提高计算速度)，每次只更新出现的词的向量(特定行)
- 为什么需要两个向量？——数学上更简单(中心词和上下文词分开考虑),最终是把2个向量平均。也可以每个词只用一个向量。
- word2vec的两个模型：Skip-grams(SG), Continuous Bag of Words(CBOW), 还有negative sampling技巧，抽样分布技巧(unigram分布的3/4次方)
- 为什么不直接用共现计数矩阵？随着词语的变多会变得很大；维度很高，需要大量空间存储；后续的分类问题会遇到稀疏问题。解决方法：降维，只存储一些重要信息，固定维度。即做SVD。很少起作用，但在某些领域内被用的比较多，举例：Hacks to X(several used in Rohde et al. 2005)
- Count based vs. direct prediction
- Glove-结合两个流派的想法，在神经网络中使用计数矩阵，共现概率的比值可以编码成meaning component
- 评估词向量的方法（内在—同义词、类比等，外在—在真实任务中测试，eg命名实体识别）
- 词语多义性问题-1.聚类该词的所有上下文，得到不同的簇，将该词分解为不同的场景下的词。2.直接加权平均各个场景下的向量，奇迹般地有很好的效果
cs224n-2019-notes02-wordvecs2
- Glove
- 评估词向量效果的方法

Suggested Readings

Additional Readings:

参考阅读

理解GloVe模型（+总结）(很详细易懂，讲解了GloVe模型的思想)

Python review[slides]

review

glove的思想、算法步骤分解、代码

评估词向量的方法

Lecture 03: Word Window Classification, Neural Networks, and Matrix Calculus

Course information update (5 mins)
Classification review/introduction (10 mins)
Neural networks introduction (15 mins)
Named Entity Recognition (5 mins)
Binary true vs. corrupted word window classification (15 mins)
Matrix calculus introduction (20 mins)

课件

cs224n-2019-lecture03-neuralnets
- 分类：情感分类，命名实体识别，买卖决策等，softmax分类器，cross-entropy损失函数(线性分类器)
- 神经网络分类器，词向量分类的不同(同时学习权重矩阵和词向量，因此参数也更多)，神经网络简介
- 命名实体识别(NER)：找到文本中的"名字"并且进行分类
- 在上下文语境中给单词分类，怎么用上下文？将词及其上下文词的向量连接起来
- 比如如果这个词在上下文中是表示位置，给高分，否则给低分
- 梯度
matrix calculus notes
cs224n-2019-notes03-neuralnets
- 神经网络，最大边缘目标函数，反向传播
- 技巧：梯度检验，正则，Dropout，激活函数，数据预处理(减去均值，标准化，白化Whitening)，参数初始化，学习策略，优化策略(momentum, adaptive)

Suggested Readings:

Additional Readings:

Natural Language Processing (Almost) from Scratch

Assignment 2

[code] [handout]

review

NER

梯度

Lecture 04: Backpropagation and Computation Graphs

Matrix gradients for our simple neural net and some tips [15 mins]
Computation graphs and backpropagation [40 mins]
Stuff you should know [15 mins]
a. Regularization to prevent overfitting
b. Vectorization
c. Nonlinearities
d. Initialization
e. Optimizers
f. Learning rates

课件

cs224n-2019-lecture04-backprop
- 梯度计算分解，一些tips，使用预训练的词向量的问题
- 计算图表示前向传播和反向传播，用上游的梯度和链式法则来得到下游的梯度
- 正则，矢量化，非线性，初始化，优化器，学习率
cs224n-2019-notes03-neuralnets

Suggested Readings:

Lecture 05: Linguistic Structure: Dependency Parsing

Syntactic Structure: Consistency and Dependency (25 mins)
Dependency Grammar and Treebanks (15 mins)
Transition-based dependency parsing (15 mins)
Neural dependency parsing (15 mins)

cs224n-2019-lecture05-dep-parsing [scrawled-on slides]

短语结构，依赖结构

cs224n-2019-notes04-dependencyparsing

Suggested Readings:

Assignment 3

[code] [handout]

Lecture 06: The probability of a sentence? Recurrent Neural Networks and Language Models

Recurrent Neural Networks (RNNs) and why they’re great for Language Modeling (LM).

cs224n-2019-lecture06-rnnlm

语言模型
RNN

cs224n-2019-notes05-LM_RNN

Suggested Readings:

N-gram Language Models (textbook chapter)
The Unreasonable Effectiveness of Recurrent Neural Networks (blog post overview)
Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.1 and 10.2)
On Chomsky and the Two Cultures of Statistical Learning

Lecture 07: Vanishing Gradients and Fancy RNNs

Problems with RNNs and how to fix them
More complex RNN variants

cs224n-2019-lecture07-fancy-rnn

梯度消失
LSTM和GRU

cs224n-2019-notes05-LM_RNN

Suggested Readings:

Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.3, 10.5, 10.7-10.12)
Learning long-term dependencies with gradient descent is difficult (one of the original vanishing gradient papers)
On the difficulty of training Recurrent Neural Networks (proof of vanishing gradient problem)
Vanishing Gradients Jupyter Notebook (demo for feedforward networks)
Understanding LSTM Networks (blog post overview)

Assignment 4

[code] [handout] [Azure Guide] [Practical Guide to VMs]

Lecture 08: Machine Translation, Seq2Seq and Attention

How we can do Neural Machine Translation (NMT) using an RNN based architecture called sequence to sequence with attention

cs224n-2019-lecture08-nmt

机器翻译：
- 1.1950s，早期是基于规则的，利用词典翻译；
- 2.1990s-2010s，基于统计的机器翻译(SMT)，从数据中学习统计模型，贝叶斯规则，考虑翻译和句子语法流畅。对齐：一对多，多对一，多对多。
- 3.2014-，基于神经网络的机器翻译(NMT)，seq2seq，两个RNNs。seq2seq任务有：总结(长文本到短文本)，对话，解析，代码生成(自然语言到代码)。贪心解码。束搜索解码
- 评估方式：BLEU(Bilingual Evaluation Understudy)
- 未解决的问题：词汇表之外的词，领域不匹配，保持较长文本的上下文，低资源语料少，没有加入常识，从训练数据中学到了偏见，无法解释的翻译，
- Attention。

cs224n-2019-notes06-NMT_seq2seq_attention

Suggested Readings:

Statistical Machine Translation slides, CS224n 2015 (lectures 2/3/4)
Statistical Machine Translation (book by Philipp Koehn)
BLEU (original paper)
Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
Sequence Transduction with Recurrent Neural Networks (early seq2seq speech recognition paper)
Neural Machine Translation by Jointly Learning to Align and Translate (original seq2seq+attention paper)
Attention and Augmented Recurrent Neural Networks (blog post overview)
Massive Exploration of Neural Machine Translation Architectures (practical advice for hyperparameter choices)

Lecture 09: Practical Tips for Final Projects

Final project types and details; assessment revisited
Finding research topics; a couple of examples
Finding data
Review of gated neural sequence models
A couple of MT topics
Doing your research
Presenting your results and evaluation

cs224n-2019-lecture09-final-projects

默认的项目是问答系统SQuAD
Look at ACL anthology for NLP papers: https://aclanthology.info
https://paperswithcode.com/sota
数据：
- https://catalog.ldc.upenn.edu/
- http://statmt.org
- https://universaldependencies.org
- Look at Kaggle，research papers，lists of datasets
- https://machinelearningmastery.com/datasets-natural-languageprocessing/
- https://github.com/niderhoff/nlp-datasets

final-project-practical-tips

Suggested Readings:

Practical Methodology (Deep Learning book chapter)

Lecture 10: Question Answering and the Default Final Project

Final final project notes, etc.
Motivation/History
The SQuAD dataset
The Stanford Attentive Reader model
BiDAF
Recent, more advanced architectures
ELMo and BERT preview

cs224n-2019-lecture10-QA

两个部分：寻找那些可能包含答案的文档(信息检索)，从文档或段落中找答案(阅读理解)
阅读理解的历史，2013年MCTest：P+Q——>A，2015/16：CNN/DM、SQuAD数据集
开放领域问答的历史：1964年是依赖解析和匹配，1993年线上百科全书，1999年设立TREC问答，2011年IBM的DeepQA系统，2016年用神经网络和信息检索IR
SQuAD数据集，评估方法
斯坦福的简单模型：Attentive Reader model，预测回答文本的起始位置和结束位置
BiDAF
cs224n-2019-notes07-QA