📓
Study
  • README
  • Application
    • Contest
      • 竞赛trick
  • Basic Know
    • 半监督学习
    • 贝叶斯
      • 朴素贝叶斯分类器
    • 对抗训练
    • 概率图模型
      • CRF
      • HMM
      • 概率图模型
    • 关联分析
    • 归纳偏置
      • [什么是 Inductive bias(归纳偏置)?](BasicKnow/归纳偏置/什么是 Inductive bias(归纳偏置)?.md)
    • 聚类
    • 决策树
    • 绿色深度学习
    • 树模型&集成学习
      • 提升树
      • Ada Boost
      • [集成学习]
    • 特征工程
      • 数据分桶
      • 特征工程概述
      • 特征选择
      • LDA
      • PCA
    • 线性模型
      • 感知机
      • 最大熵模型
      • SVM
        • SVM支持向量机
      • 逻辑回归
      • 线性回归
    • 优化算法
      • 拉格朗日对偶性
      • 牛顿法
        • 牛顿法&拟牛顿法
      • 梯度下降法
        • 梯度下降算法
      • 优化算法
    • 预处理
      • [1-1]正则表达式
      • [1-2]文本预处理
      • [1-3]词性
      • [1-4]语法分析
      • [1-6]文本分类
      • [1-7]网络爬取
      • 【备用】正则表达式
      • 7.re模块
      • 词典匹配
      • 分词
      • 子表达式
      • Todo
    • 主题模型
      • LDA
    • Deep Learning
      • 反向传播
      • 梯度消失&梯度爆炸
      • Batch Size
      • 1.DLbasis
      • 小概念
      • MLstrategy
      • CNN
      • RNN及其应用
      • 关于深度学习实践
      • 神经网络概述
      • Batch Normalization
      • Program CNN
      • Program D Lbasis
      • Program DN Nimprove
      • Program Neural Style Transfer
      • Summer DL
    • EM算法
    • GAN
      • Gans In Action Master
    • GNN
      • 搜广推之GNN
      • Representation Learning
        • Anomalydetection
        • Conclusion
        • Others
        • Papernotes
        • Recommadation
    • k近邻法
      • K近邻
    • Language Model
      • 语言模型解码采样策略
      • [1-1][语言模型]从N-gram模型讲起
      • [1-2][语言模型]NNLM(神经网络语言模型)
      • [1-3][语言模型]基于RNN的语言模型
      • [1-4][语言模型]用N-gram来做完形填空
      • [1-5][语言模型]用KenLM来做完形填空
    • Loss Function
      • 常用损失函数
      • Focal Loss
      • softmax+交叉熵
    • Machine Learning
      • [基础]概念
      • 待整合
      • 交叉验证
      • 无监督学习
      • 优缺点
      • ML Yearning
      • SVD
    • Statistics Math
      • 程序员的数学基础课
      • 数学基础
      • 统计&高数
      • 统计题目
      • 线性代数
      • 组合数学
      • Discrete Choice Model
      • Nested Choice Model
  • Course Note
    • 基于TensorFlow的机器学习速成课程
      • [Key ML Terminology](CourseNote/基于TensorFlow的机器学习速成课程/Key ML Terminology.md)
    • 集训营
      • 任务说明
      • 算法实践1.1模型构建
      • 算法实践1.2模型构建之集成模型
      • 算法实践2.1数据预处理
    • 李宏毅机器学习
      • 10DNN训练Tips
        • Chapter 18
      • 16无监督学习
        • Chapter 25
    • 贪心NLP
      • 贪心NLP笔记
    • Cs 224 N 2019
      • [A Simple But Tough To Beat Baseline For Sentence Embeddings](CourseNote/cs224n2019/A Simple but Tough-to-beat Baseline for Sentence Embeddings.md)
      • [Lecture 01 Introduction And Word Vectors](CourseNote/cs224n2019/Lecture 01 Introduction and Word Vectors.md)
      • [Lecture 02 Word Vectors 2 And Word Senses](CourseNote/cs224n2019/Lecture 02 Word Vectors 2 and Word Senses.md)
      • [Lecture 03 Word Window Classification Neural Networks And Matrix Calculus](CourseNote/cs224n2019/Lecture 03 Word Window Classification, Neural Networks, and Matrix Calculus.md)
      • [Lecture 04 Backpropagation And Computation Graphs](CourseNote/cs224n2019/Lecture 04 Backpropagation and Computation Graphs.md)
      • [Lecture 05 Linguistic Structure Dependency Parsing](CourseNote/cs224n2019/Lecture 05 Linguistic Structure Dependency Parsing.md)
      • [Lecture 06 The Probability Of A Sentence Recurrent Neural Networks And Language Models](CourseNote/cs224n2019/Lecture 06 The probability of a sentence Recurrent Neural Networks and Language Models.md)
      • Stanford NLP
    • Deep Learning Book Goodfellow
      • Books
        • Deep Learning Book Chapter Summaries Master
      • 提纲
      • C 5
      • C 6
      • [Part I Applied Math And Machine Learning Basics](CourseNote/Deep-Learning-Book-Goodfellow/Part I - Applied Math and Machine Learning basics.md)
    • Lihang
    • NLP实战高手课
      • 极客时间_NLP实战高手课
    • 工具&资料
    • 机器学习、深度学习面试知识点汇总
    • 七月kaggle课程
    • 算法工程师
    • 贪心科技机器学习必修知识点特训营
    • 唐宇迪机器学习
    • 语言及工具
    • AI技术内参
    • Suggestions
  • Data Related
    • 数据质量
      • 置信学习
    • 自然语言处理中的数据增广_车万翔
      • 自然语言处理中的数据增广
    • Mixup
    • 数据不均衡问题
    • 数据增强的方法
  • Knowledge Graph
    • Information Extraction
      • 联合抽取
        • PRGC
      • Code
        • BERT微调
      • NER
        • 阅读理解做NER
          • MRC
        • FLAT
        • Global Pointer
        • 命名实体识别NER
    • Keyword Extraction
      • 关键词抽取
    • 小米在知识表示学习的探索与实践
    • KG
  • Multi Task
    • EXT 5
      • Ex T 5
  • NLG
    • Dailogue
      • 比赛
        • 对话评估比赛
          • [simpread-DSTC10 开放领域对话评估比赛冠军方法总结](NLG/Dailogue/比赛/对话评估比赛/simpread-DSTC10 开放领域对话评估比赛冠军方法总结.md)
      • 任务型对话
        • DST
          • DST概述
        • NLG
          • NLG概述
        • NLU
          • NLU概述
        • 任务型对话概述
        • simpread-任务型对话系统预训练最新研究进展
      • 问答型对话
        • 检索式问答
          • 基于预训练模型的检索式对话系统
          • 检索式文本问答
        • 业界分享
          • 低资源场景下的知识图谱表示学习和问答_阿里_李杨
          • QQ浏览器搜索智能问答
        • 问答型对话系统概述
      • 闲聊型对话
        • 闲聊型对话系统概述
      • 业界分享
        • 人工智能与心理咨询
        • 腾讯多轮对话机器人
        • 微软小冰
        • 小布助手闲聊生成式算法
        • 美团智能客服实践_江会星
        • 去哪儿智能客服探索和实践
        • 实时语音对话场景下的算法实践_阿里_陈克寒
        • 智能语音交互中的无效query识别_小米_崔世起
        • UNIT智能对话
      • 主动对话
      • EVA
        • EVA分享
        • EVA模型
      • PLATO
      • RASA
    • Machine Translation
      • 业界分享
        • 爱奇艺台词翻译分享
      • Paper
        • Deep Encoder Shallow Decoder
    • RAGRelated
    • Text 2 SQL
      • M SQL
        • [M SQL 2](NLG/Text2SQL/M-SQL/M-SQL (2).md)
      • [Text2SQL Baseline解析](NLG/Text2SQL/Text2SQL Baseline解析.md)
      • Text 2 SQL
    • Text Summarization
      • [文本摘要][paper]CTRLSUM
      • 文本摘要
  • Pre Training
    • 业界分享
      • 超大语言模型与语言理解_黄民烈
        • 超大语言模型与语言理解
      • 大模型的加速算法_腾讯微信
        • 大模型的加速算法
      • 孟子轻量化预训练模型
      • 悟道文汇文图生成模型
      • 悟道文澜图文多模态大模型
      • 语义驱动可视化内容创造_微软
        • 语义驱动可视化内容创造
    • Base
      • Attention
      • Mask
        • NLP中的Mask
      • Position Encoding
        • 位置编码
    • BERT
      • ALBERT
      • Bert
        • Venv
          • Lib
            • Site Packages
              • idna-3.2.dist-info
                • LICENSE
              • Markdown-3.3.4.dist-info
                • LICENSE
              • Tensorflow
                • Include
                  • External
                    • Libjpeg Turbo
                      • LICENSE
                  • Unsupported
                    • Eigen
                      • CXX 11
                        • Src
                          • Tensor
              • Werkzeug
                • Debug
                  • Shared
                    • ICON LICENSE
        • CONTRIBUTING
        • Multilingual
      • Ro BER Ta
      • BERT
      • BERT面试问答
      • BERT源码解析
      • NSP BERT
    • BERT Flow
    • BERT Zip
      • Distilling The Knowledge In A Neural Network
      • TINYBERT
      • 模型压缩
    • CPM
    • CPT
      • 兼顾理解和生成的中文预训练模型CPT
    • ELECTRA
    • EL Mo
    • ERNIE系列语言模型
    • GPT
    • MBART
    • NEZHA
    • NLG Sum
      • [simpread-预训练时代下的文本生成|模型 & 技巧](Pre-training/NLGSum/simpread-预训练时代下的文本生成|模型 & 技巧.md)
    • Prompt
      • 预训练模型的提示学习方法_刘知远
        • 预训练模型的提示学习方法
    • T 5
      • Unified SKG
      • T 5
    • Transformer
    • Uni LM
    • XL Net
    • 预训练语言模型
    • BERT变种
  • Recsys
    • 多任务Multi-task&推荐
    • 推荐介绍
    • 推荐系统之召回与精排
      • 代码
        • Python
          • Recall
            • Deep Match Master
              • Docs
                • Source
                  • Examples
                  • FAQ
                  • Features
                  • History
                  • Model Methods
                  • Quick Start
    • 业界分享
      • 腾讯基于知识图谱长视频推荐
    • 召回
    • Sparrow Rec Sys
    • 深度学习推荐系统实战
    • 推荐模型
    • Deep FM
  • Search
    • 搜索
    • 业界分享
      • 爱奇艺搜索排序算法实践
      • 语义搜索技术和应用
    • 查询关键字理解
    • 搜索排序
    • BM 25
    • KDD21-淘宝搜索中语义向量检索技术
    • query理解
    • TFIDF
  • Self Supervised Learning
    • Contrastive Learning
      • 业界分享
        • 对比学习在微博内容表示的应用_张俊林
      • Paper
      • R Drop
      • Sim CSE
    • 自监督学习
  • Text Classification
    • [多标签分类(Multi-label Classification)](TextClassification/多标签分类(Multi-label Classification)/多标签分类(Multi-label Classification).md)
    • Fast Text
    • Text CNN
    • 文本分类
  • Text Matching
    • 文本匹配和多轮检索
    • CNN SIM
    • Word Embedding
      • Skip Gram
      • Glove
      • Word 2 Vec
    • 文本匹配概述
  • Tool
    • 埋点
    • 向量检索(Faiss等)
    • Bigdata
      • 大数据基础task1_创建虚拟机+熟悉linux
      • 任务链接
      • Mr
      • Task1参考答案
      • Task2参考答案
      • Task3参考答案
      • Task4参考答案
      • Task5参考答案
    • Docker
    • Elasticsearch
    • Keras
    • Numpy
    • Python
      • 可视化
        • Interactivegraphics
        • Matplotlib
        • Tkinter
        • Turtle
      • 数据类型
        • Datatype
      • python爬虫
        • Python Scraping Master
          • phantomjs-2.1.1-windows
        • Regularexp
        • Scrapying
        • Selenium
      • 代码优化
      • 一行代码
      • 用python进行语言检测
      • Debug
      • Exception
      • [Features Tricks](Tool/python/Features & Tricks.md)
      • Fileprocess
      • Format
      • Functional Programming
      • I Python
      • Magic
      • Math
      • Os
      • Others
      • Pandas
      • Python Datastructure
      • Python操作数据库
      • Streamlit
      • Time
    • Pytorch
      • Dive Into DL Py Torch
        • 02 Softmax And Classification
        • 03 Mlp
        • 04 Underfit Overfit
        • 05 Gradient Vanishing Exploding
        • 06 Text Preprocess
        • 07 Language Model
        • 08 Rnn Basics
        • 09 Machine Translation
        • 10 Attention Seq 2 Seq
        • 11 Transformer
        • 12 Cnn
        • 14 Batchnorm Resnet
        • 15 Convexoptim
        • 16 Gradientdescent
        • 17 Optim Advance
    • Spark
      • Pyspark
        • pyspark之填充缺失的时间数据
      • Spark
    • SQL
      • 数据库
      • Hive Sql
      • MySQL实战45讲
    • Tensor Flow
      • TensorFlow入门
  • Common
  • NLP知识体系
Powered by GitBook
On this page
  • Day 1
  • Day 2
  • Day 3
  • Day 4

Was this helpful?

  1. Basic Know
  2. Deep Learning

Summer DL

PreviousProgram Neural Style TransferNextEM算法

Last updated 2 years ago

Was this helpful?

Statistical Theory of Deep Learning and Its Applications

Ye Luo (University of Hong Kong)

Day 1

Why do we study neural networks?

  • Neural network (intuitively) has much better properties than other statistical/ machine learning procedures, which means that it also has greater potentials.

  • Many theoretical properties are still unknown. Guidance is needed for practical purposes.

  • In practice, neural network outperforms almost every other statistical/machine learning procedures when large amount of data is available.

What does people do with neural networks in practice?

  • Image recognition/ Speech recognition.

  • Predict outcomes of individual behaviors based on big data.

  • Develop just-in-time system to predict gaps in supply and demand and therefore automatically generate pricing policies.

  • Build complicated models to interpret patterns of data that humans do not understand.

  • There exists huge potential in using neural networks in analyzing complicated data.

Performance comparison

线性回归的收敛速度是$1/\sqrt{n}$,n是数据量

What is a neural network?

  • Neural network consists one input layer, with independent variables z serving as inputs. So the number of input neurons equals to the dimensionality of z.

  • one output layer, with dependent variable y serving as the target of outputs (y is not output, but you want the output to be as close to y as possible under a certain norm/a loss function). The number of output neurons equals to the dimensionality of y.

  • one or more hidden layers of neurons that link inputs and outputs.

  • One hidden layer: shallow neural network. More than one: deep neural network. Deep Learning: using deep neural network as estimators.

  • Values propagate from inputs to outputs layer after layer, via activation functions (will explain shortly).

What do neural networks do?

  • Each neuron has a few parameters of its own (w and b). So there are many parameters in total, denoted them as a vector $\beta$.

  • Given all the values of the links, you can calculate the output vector $\hat y$ given x through feedforward, i.e., you can forward propagate the values of neurons layer by layer. Therefore, the output of neural network is a function of x, depending on the parameter . Denote it as $\hat y=\psi (x|\beta)$ .

  • Given a loss function L(·, ·), you solve: $min_{\beta}\sum_{i=1}^n L(\hat y_i, y_i)min_{\beta}\sum_{i=1}^n L(\psi(x_i|\beta), y_i)$L can be any function of interests, including:

    • L2 loss:$L(\hat y, y) = ||\hat y − y||_2^2$

    • L1 loss:$L(\hat y, y) = ||\hat y − y||_1$.

    • Entropy (for y being one-dimensional): $L(\hat y, y) = ylog(\frac{\hat y}{y})$

Neural Network in 1990s

  • Mostly used Neural Networks were shallow neural Networks. Deep Neural Networks are too difficult to optimize.

  • It was also known as “projection pursuit”. Neural Networks, as “projection pursuit”, has nothing to do with human brain structures. Model $f (x) = \sum_{i=1}^k\phi_i(x\beta_i).\beta_ i $: projection parameter. $\phi_i$: non-linear functional link needed to be learned.

  • Later, researchers discover that it is ok to set $\phi_i=c_i\phi, \phi$ is a common (increasing) non-linear link function.

  • Training cost were high compared to other methods such as regression, SVM and others. Non-linear, non convex optimization.

  • Bayesian Networks and Markov Chain Monte-Carlo methods were a popular way to work with neural networks. Also computational costly.

Some other popular Machine Learning methods

  • Monitored Learning: there is a guidance variable y to be fitted against some predictors x.

  • Regression-based machine learning methods: Regression, LASSO, Ridge, Boosting.

  • Classification-based machine learning methods: Random Forest, AdaBoost, Support Vector Machines(SVM), etc..

  • Un-Monitored Learning: There is no guidance variable y to be fitted. What is the objective function to optimize?

  • K-Means algorithm, Mixture models and EM algorithms, clustering algrithms, discrete smoothing.

  • Many other modern machine learning techniques that performs multiple of above tasks together. Multi-algorithms AI systems that may include one or more ML algorithms such as DL and others.

LASSO

$\beta = argmin_\beta Q(\beta) +\lambda \sum_{i=1}^p |\beta_i|$

  • $\lambda$ is a tuning parameter.

  • Q is the objective function.

  • Good when Q is convex - convex optimization.

  • Variable Selection and good performance in high dimensional models (dim($\beta $) > n, where n is sample size).

SVM

Non-linearity

  • In LASSO or SVM, the way to solve non-linearity is to introduce non-linear transformations of x - expanding the base functions/base learners.

  • In LASSO, we can use z = polynomials of x to replace x.

  • In SVM, we can use kernel functions K(x, x') to replace x.

  • SVM was highly successful and popular before Deep Learning rises.

Key Features of these machine learning methods other than neural networks

  • Need to specify x. For non-linear models, use non-linear transformations of x as functional bases, e.g., polynomials, splines,Fourier Series, wavelets, kernels, etc..

  • Pre-specify the functional bases before learning. Additive structure.

  • Coefficients/weights on predictors are trainned after choosing base functions.

  • Doing well when base function are properly chosen. Not so well if otherwise. Representation is key!

Why neural networks are different?

  • The base function are not pre-specified.

  • Train the base functions together with coefficients.

  • Neural Networks is a type of “flexible base” learners, versus “fixed base” learners.

Why flexible base is good?

  • Let us go back to a more fundamental question: What does machine learning do?

  • We know that Machine Learning recognizes (unknown) patterns and rules hidden in the data.

  • But what are patterns? What are rules?

How could a machine find out Kepler’s rules?

  • It is well known that $T^2/R^3$ = constant. Kepler found these rules by scanning a large amount of data.

  • How to find out (or approximate) this rule in data?

  • In 1970s, AI scientists uses a decision tree to find out. Operators allowed are × and . Leaves are either R or T.

  • A combinatorial problem, extremely difficult to solve.

  • What if you are not using × and as operators? What if you are using $T^2$ and $R^3$ to replace T and R?

Another Example: Einstein’s Theory of Relativity

Problems for fixed base estimators

  • The performance highly relies on the choice of functional base and operations.

  • Requires a lot of modelling process. However, human intuitions towards complex problems are always wrong, and at best an approximation.

  • The underlying assumption of these machine learning algorithm is: “there exists a simple expression of the rules under this set of base function.”

  • Such an assumption is not always true.

Patterns are simple expression of the base functions

  • Imagine a Chess board with 64 grids. Each grid is either White or Black. Assume that there is only 8 grids are Black, with the rest of them being White.

  • If all the Black grids lie on the diagonal, do you think this is a pattern? Use i = j to charactize a pattern.

  • If randomly select 8 Black grids, do you consider it as a pattern? -You need 8 equations instead of 1 equation.

  • So, many machine learning algorithms learn simple expressions (or we call them as patterns) under the “base functions” that are pre-specify.

Modern Data Calls for Deep Learning

  • Modern Data: complex, unstructured/structured, high dimensional. Non-linear representation of patterns are everywhere, e.g., computer vision, natural language proessing, and many others.

  • Chentong Qiu: Modern data is high dimensional data that is embedded “around” low dimensional manifold.

  • (Non-Linear) Dimensionality Reduction; Geometric Transformation of Masses.

Dimensionality Reduction by controlling the size of DNN

Autoencoder,可以用来降维,先经历一个encoder降维,后经历decoder恢复原来的维度。如果想大幅度降维,如8维降到1维,最好先降到一个中间维度,再降到目标维度,这样比直接降维到目标维度要好。

主要用于自然语言处理中。

Geometric of Deep Learning

Figure: Picture from Gu et. al. (2018)

螺旋线,先降维到1维,再恢复2维

2层(x——y) logistic regression linear algebra

3层(x——O——y) 傅里叶变换

Deep Learning is fundamentally different compared to traditional methods

  • Deep Learning does not pre-specify base functions. Rather, it learns the base functions which adapt to data. So in general, it has better performances than other fixed base machine learning methods.

  • In some specific problems, fixed base methods can have better performances.

  • Deep Learning has much stronger expression power compared to other traditional methods.

  • Deep Learning gains more and more benefits as the problems to be solved becomes larger and larger, i.e., more and more dimension of the variable x.

Universal Approximation property一致逼近理论

  • It is also known that neural network with large depth can have similar properties.

  • Rojas (2003) shows that recurrent neural network with Perceptron as activation function and width = 1 can approximate any two-way classifiers.

  • Sutskever and Hinton (2008) shows that Deep RBM can approximate any binary distributions.

Universal Approximation property in depth

  • Luo and Liang(2017) proves that neural network with fixed width and large depth can approximate any countinuous functions or multi-way classifiers for general activation functions.

  • We started with showing the results for rectified linear functions.

  • Then, the results can be generalized to abitrary activation functions such as sigmoid. Our theoretical results show that many local optimals can have similar performance as the global optimal.

  • Extension to High Dimensional Variable Selection problems.

Andrew Baron’s early work on Shallow Neural Networks

  • Andrew Baron (1990), Andew Baron (1994) establish approximation bound of shallow neural networks.

  • The proof of the theory has an interesting link with Fourier analysis. In fact, a shallow neural network can be viewed as approximating a function on the Fourier space.

Risk bounds on neural network

$O_p$表示同阶,可以视作等式左边≤$C(\frac1{K_n}+\frac{pK_n}nlog(n))$,第一部分表示逼近,第二部分表示overfitting,n是sample size,p是dim(x).

$K_n$不能太大,也不能太小。要使上式的上界最小,需要两项相等,得到$K_n^2=\frac n{plog(n)}$,因此逼近部分为$\frac1{K_n}=(\frac{plog(n)}n)^{-1/2}$,随p(x的维度)线性增长.

而对于fixed base methods,逼近随着p指数级增长(见下图)

维度诅咒

Universal Approximation property in depth in nonparametric regression

Why Overfitting is important?

  • When the number of parameters is numerous, it is quite possible to overfit the data.

  • The parameter to data ratio p/n appears in many different risk bounds of linear models, e.g., regression.

  • Neural Networks is born to have many parameters. Great approximation ability comes along with great danger of overfitting.

  • In principle, simple models tends to have less risk of overfitting.

Complexity Measures in Statistics

  • In statistics, there is a few ways to measure complexity of models.

  • First way is the VC dimension. VC dimension is popular in statistics.

  • Second way is statistical entropy. We will focus on entropy today. Entropy can be viewed as a distribution-specific way of characterzing VC dimensions.

  • Reference: Van der Vaart and Wellner (1996). Empirical process.

$B_{\epsilon}(f)={g| ||f-g||^2 \leq \epsilon}, ||f-g||^2=\int ||f(x)-g(x)||^2 d\mu(x) ,\mu(x)是x的密度函数$

$\hat \beta=argmin_\beta \sum(y_i - g(x_i, \beta))^2$

$\beta_0=argmin_\beta E[(y - g(x, \beta))^2]$

$\hat\beta ——> \beta_0 \text{ if } \frac1n \sum(y_i - g(x_i, \beta))^2 ——> E[(y - g(x, \beta))^2] \text{ for all } \beta \in R^\beta$

也就是要求处处收敛。

过拟合产生的原因:没有处处收敛

中心极限定理:

$\frac1{\sqrt{n}}[\sum x_i - E[x_i]] ——> N(0, \sigma^2)$

对于单位球,$N(\epsilon, \mathcal{F}, || . ||) \leq \frac1{\epsilon^{2dim(\beta)}}$,其中分母是每个小球的面积

Examples

How much overfitting?

Approximation

References

  • A. Barron, 1990, Universal Approximation Bounds for Superpositions of a Sigmoidal Function.

  • A. Barron, 1994, Approximation and Estimation Bound for Artificial Neural Networks.

  • Hornik, Stinchcombe and White, 1989, Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks.

  • R. Rojas, 2003, Networks of Width One Are Universal Classifiers.

  • Sutskever and Hinton 2008, Deep, narrow sigmoid belief networks are universal approximators.

  • Luo and Liang(2017), Universal Approximation of Deep Neural Networks and Estimation Bounds.

Day 2

Backpropagation

  • Rumelhard, Hinton and Williams (1986), learning representation by back-propagating errors.

  • Loss function $L = (t − y)^2$, where t is the value predicted by the neural network.

  • Neural network with L layers.

Standard backpropagation algorithm as gradient descent

  • Learning rate $\lambda_k$ is often chosen as a small constant.

  • Drawbacks: no theoretic garrantee of convergence. No good local performance.

  • Instead, use step size shrinking to 0. E.g., $\lambda_k=1/k$ or $\lambda_k=1/\sqrt{k}$ .

  • Does not work when data set is too large.

  • Large computational burden when data set is too large. Switch to stochastic gredient descent.

  • Good for fine tuning. No garrantee for good global performances.

数据量大时,需要计算每个数据点的梯度,然后取平均,计算量太大,因此不适合用梯度下降

Stochastic gradient descent

  • Instead of using the full data to evaluate L, we use a batch of (randomly drawn) data $D_k$ to evaluate the loss function at this small batch of data.

  • Stochastic gradient descent requires learning rate to satisfy $\sum_{k=1}^{\infty}\lambda_k=\infty, \sum_{k=1}^{\infty}\lambda_k ^2<\infty$

  • On the one hand, you need to have enough length of total step size能够走足够远. On the other hand, the stochastic error must be contained.

  • For example, one can pick $\lambda_k : = k^{\alpha}, \text{ with } \alpha \in (0.5,1)$

  • In statistics, this trick is referred as “stochastic approximation”.

GD: $\beta^{k+1}=\beta^k -\lambda_k\frac{\partial L}{\partial \beta^k}$

SGD:$\beta^{k+1}=\beta^k -\lambda_k(\frac{\partial L}{\partial \beta^k}+\eta_k),\eta_k$ 是noise,服从标准正态分布

GD:$\beta^{\infty}=\beta^0 -\sum_{k=1}^\infty\lambda_k\frac{\partial L}{\partial \beta^k}$

SGD:$\beta^{\infty}=\beta^0-\sum_{k=1}^\infty\lambda_k\frac{\partial L}{\partial \beta^k}-\sum_{k=1}^\infty\lambda_k\eta_k$

要使SGD最终收敛到最优点,最后一项需要收敛到0,其方差必须收敛,因此

$\sum_{k=1}^\infty\lambda_k^2\gamma_k^2 \leq \sum_{k=1}^\infty\lambda_k^2\lt\infty$

Challenges with Gredient Descent algorithm

  • Vanishing /Exploding Gredient problem.

  • Good initialization.

  • Picking “optimal” learning rate.

Problems with gradient

梯度接近0,一个原因是快到最优点了,另一个原因是梯度数值小,乘积更小

sigmoid函数容易出现这个问题,它的两端梯度小

  • The derivative of Sigmoid function $\phi(x) = 1/(1+e^{-x})$ derivative is $\phi(x)(1 − \phi(x))$.

  • The derivative is bounded by $e^{−|x|}$. So when x is large, the gradient vanishes in an exponential speed.

  • Vanishing gradient makes the optimization stack at local optimum.

Exploding gradient

一层有很多神经元,很宽,前一层的梯度需要将上一层的梯度相加,sigmoid也容易出现梯度爆炸问题

  • Let us look at a deep neural network with a single neuron in each layer.

  • By the chain rule, the gradient is a product of many terms.

  • The gradient can be super large! Need adaptive learning rate.

relu不容易出现这两个问题

Solutions to Vanishing gredient and exploding gradient

  • Normalization of each variable xi .

  • Normalizing gradient/ adaptive learning rate.

  • Add regularization.

  • Use Rectified Linear instead of Sigmoid.

  • Many other ways, see Pascanu, Mikolov and Beggio (2013).

Initialization and pretraining

  • Initialization is extremely important for training deep neural networks.

  • Traditional way of initialization: set all weights as small random variables drawn from $[−\alpha, \alpha]$. Rule of thumb: $\alpha$ = 1.

  • Hueristics: the neural network is locally linear. Fit the “best” linear function first, then fine tuning.

pretraining with AutoEncoders

  • Pretraining is to get a good recipe for the deep neural network to start with.

  • How to do pretraining: Stacked AutoEncoders.

  • What is an autoencoder: train X to X.

  • Deep Belief neural networks.

An illustrative Figure

Why Does Unsupervised Pre-training Help Deep Learning

Let’s get back to one of the prior motivations of going deep with neural networks. One of the primary reason for doing so was to capture transformations of input features and then further transformations of the output of the first layer and so on. Learning a representation was not part of the learning method described above. One trick to learn the representation is to reproduce the input. Suppose we want to learn a L depth neural network. We start by first training the first layer so that the input can be produced as output. The is shown in the first row of Figure 4. By trying to reproduce the output the hope is that the neurons will learn a representation of the input. This is called an auto-encoder. (If noise is added to the input, but not to the output which is meant to be learned, then it is a denoising auto-encoder.) Next, weights of the first layer are kept fixed and the second layer’s weights are learned so as to reproduce the output of the first layer. This hopes to capture a bit more complex of a representation than the first layer. This method of greedily training each layer one at a time is called pre-training with stacked auto-encoders. Finally all the weights are initialised to the learnt weights and the backpropagation algorithm is run on supervised data to get the final weights. This step is called finetuning. This procedure of learning has been shown to outperform just learning a DNN all at once from randomly initialised weights. There were two prominent hypotheses that were proposed in the paper to justify why this way of training works better than no pre-training.

The Optimisation Hypothesis: The first hypothesis is that since this is a non-convex optimisation problem the pre-training helps in providing a better initialisation point leading to better optimisation of the empirical risk.

The Regularisation Hypothesis: The second hypothesis that this paper checks for is whether such a method acts as a better regularisation and hence leads to a better generalization error.

AutoEncoders

  • AutoEncoders as Nonlinear PCA.

  • It compresses variable x into a lower dimensional representation.

  • The representation is non-linear. Solution may be non-unique.

  • Luo and Liang (2016) shows consistency of Deep AutoEncoders.

Ideas behind pretraining

  • Idea 1: Denoising. Pretraining obtains key features/representation of x.

  • Idea 2: Better Initialization.

  • Idea 3: Better Regularization: avoid overfitting.

层数越多,pretraning 的优势越大

problems with learning rate

  • Picking the right learning rate is extremely important in training a neural network.

  • Convergence of the algorithm depends on the setup of learning rate.

  • Learning rate too small: The algorithm is easy to be trapped at local minima.

  • Learing rate too large: The algorithm does not converge well.

Several different ways to adjust learning rate

Accelerated Gredient Descent

Accelerated Gredient Descent: Study of Hinton et.al. 2013

Historical Remark Boosting算法

  • AdaBoost for classification by Freund and Shapire.

    • Ensemble method, combining of many “weak”classifiers to produce a powerful “committee”(majority voting)

    • “best off-the-shelf classifier in the world”(Breiman, 1998)

    • Resistance to overfitting

  • AdaBoost algorithm as gradient descent algorithm in function space(Breiman, 1998)

  • Embedding of boosting algorithms into the framework of statistical estimation and additive basis expansion (Friedman, 2001, and Friedman et al. 2000)

  • Representation of boosting as “stagewise, additive modelling”

  • “matching pursuit”, “greedy algorithm”

L2Boosting: Prediction

Boosting

  • Ensemble method, combining of many “weak” classifiers to produce a powerful

  • Gradient descent algorithm in function space

  • “best off-the-shelf classifier in the world”(Breiman, 1998)

  • Unusual penalization scheme via early stopping

  • Very flexible by combination of criterion functions and learners.

  • Stepwise Algorithm.

L2Boosting (BA)

Post-L2Boosting algorithm (post-BA)

例如一共五个变量$x_1, x_2, ..., x_5$,经过L2Boosting,跑了三步停止了,选择了$x_1, x_2, x_3$,那么$\hat T={1, 3, 5}$即$\beta$中的非零元素index集合

$\tilde\beta = argmin\sum_{i=1}^n (y_i-x_{i1}\beta_1-x_{i3}\beta_3-x_{i5}\beta_5)$

Greedy algorithms

  • Temlyakov (and coauthors) seminal contributions to the theory of greedy algorithms / approximation theory

  • Rate of convergence for the pure greedy algorithm(PGA) $m^{−1/6}$ where m denotes the number of steps (later improved to $m^{−11/62}$, but a lower bound of $m^{−0.27}$)

  • Idea: Improve rates by additional assumptions on the design matrix and analysis of the revisiting behavior of boosting

  • Intuition: slow rate of convergence if continuously new variables are selected / visited.

Boosting:$y=x\beta+\epsilon$会overfit

PGA:$y=x\beta$不会overfit

Approximation Theory of PGA based on revisiting (Luo and Spindler (2017))

$||y-x\beta||^2=O(m^{-11/62})$

$||y-x\beta||^2=O((\frac s{m+s})^{\epsilon^*(c)-\delta}),s=|supp(\beta)|$

Main result for L2Boosting

Boosted Backpropagation

  • Grubb and Bagwell (2010). Train the neural network in a greedy way.

    Schwenk and Bengio (2010). Applying Adaboost to neural network.

  • Idea: After calculate the gredient g. Pick a direction from the set $\mathcal{H}$ to

    maximize < H, g >.

  • $\mathcal{H}$ can be set as element vectors, i.e., only one-component is 1, with the rest of them being 0. Leads to greedy algorithm.

  • Substantial improvement of performance.

Boosted Backpropagation: Study from Schwenk and Beggio

Resnet: mixing Boosting with DNN

Predictive Resnet: training DNN with Boosting

一块一块地优化

Classification Resnet: training DNN with Boosting

Huang et.al. (2017).

Over view of non-local optimization methods

  • Markov Chain Monte-Carlo.将优化问题变成积分问题,不断sampling,低维情形表现好,高维表现不好

  • Simulated Annealling.有cooling down问题

  • EM algorithm.可以处理missing variables的情况

  • Genetics Algorithm.

EM-algorithm

  • Missing variables $x=(x^{obs}, x^{mis})$,Predicting y

  • Likelihood: $f(y|x^{obs},\theta)=\int_{x^{mis}}f(y|x^{obs}, x^{mis}, \theta)f(x^{mis}|x^{obs}, \theta)dx^{mis}$

  • Difficult for optimization

例如:quantile regression百分位回归

$y=x'\beta(\tau),\tau\sim U[0,1],\tau未知,\tau|x \sim U[0,1]$

$y=x'\beta(\tau)+\epsilon,\epsilon \sim f_\epsilon(.)$

$Pr(y|x)=\int_0^1f_\epsilon(y-x\beta(\tau))d\tau难以优化,log\int d\tau \ne \int logd\tau$

Luo and Liang 2018 ICC(JRSSB)

Initialize $\theta^0$

step 1:Draw a copy of $x^{mis}, 其中x^{mis}\sim f(x^{mis}|x^{obs}, \theta)$,可用MCMC

step 2: Maximize log-likelihood assuming $x^{mis}$ is given. The likelihood is now $f (y|x^{obs} , x^{mis} , \theta)$. Obtain new $\theta$.

$\theta^{(1)}=argmax_\theta \sum_{i=1}^n logf(y|x^{obs} , x^{mis} , \theta)$容易

之前的$argmax_\theta \sum_{i=1}^n log\int f(y|x^{obs} , x^{mis} , \theta)f(x^{mis}|x^{obs} , \theta)dx^{mis}$难

step 3:k=k+1, return to step 1

Garanteed path convergence. Very fast compared to other existing algorithms.

Luo and Liang 2018

  • We mix MCMC with EM algorithms to train Deep Belief Networks.

  • Consistency and convergence of the algorithm is proved under weak conditions.

  • Hinton (2006): constrastive Divergence algorithm. The gredient is approximated by a short run of MCMC. Then perform gredient descent.

  • CD does not necessarily converge to the maximum likelihood estimate (MLE) of the parameters as noted by Carreiera-PerpiÃan and Hinton(2005) and Bengio and Delalleau (2009).

A gredient Free Algorithm to train Deep Belief Networks

  • Idea: Treating the values of neurons in the Deep Belief Networks as missing variables. We name this algorithm as “GICC”.

  • Apply EM algorithm layerwise.

  • The Bayesian probability can be separated because of the dependency structure.

  • Imputing/sampling the values by MCMC.

  • The only thing to solve is many logistic regressions.

  • Benefits: 100000 times faster than CD algorithm in convergence.

  • Good global convergence property.

  • Logistic regressions are easy to solve and converge well in global sense.

Restricted Bolzmann Machine

Deep Belief Networks传递的是分布,而不是值(一般神经网络传递的是值),来自限制玻尔兹曼机

hidden states是不可见的

令$w_{ij} :h_i <—> v_j$,即两个结点连接的权重

$Pr(h_i=1|v)=\sigma(\sum_{j=1}^n w_{ij}v_j + a_i),\sigma(x)=1/(1+e^{-x})$,hi只能取0和1

$Pr(v_j|h)=\sigma(\sum_{i=1}^m w_{ij}h_i + b_j)$

$P(V)=\prod_{i=1}^nP(v_i),P(v_i)=\sum_h P(v_i|h)p(h),logP(v_i)难,log里有加法$

用ICC算法:

(1) Draw $h_i \sim \sigma(\sum_{j=1}^n w_{ij}v_j + a_i)$

(2) $w=argmax\sum_{j=1}^n logP(v_j|h)$,logistic regression

(3)go back to (1)

Experiments Results

Conclusion

  • SGD is still the most popular way of optimizing Neural Networks.

  • Requires a lot of computing power and good tricks to make it work.

  • Intialization, learning rate, etc..

  • Non-gredient based methods may take over in the future.

References

  • Rumelhard, Hinton and Williams, 1986, learning representation by back-propagating errors.

  • Luo and Liang, 2016, Consistency of Stacked Deep AutoEncoders.

  • Pascanu, Mikolov and Bengio, 2013, on the difficulty of training recurrent neural networks.

  • Sutskever, Martens, Dahl and Hinton, 2013, On the importance of initialization and momentum in deep learning.

  • Grubb and Bagwell, 2010, Boosted Backpropagation Learning for Training Deep Modular Networks.

  • Schwenk and Bengio, 2010, Boosting Neural Networks.

  • Hinton, Osindero and Teh, 2006, A Fast Learning Algorithm for Deep Belief Nets.

  • Liang and Luo, 2017, A likelihood gredient free algorithm for fast training of Restricted Boltzmann Machines.

Day 3

What is overfitting?

Complexity principle,Simplicity principle,人们总是更倾向于简单的模型,但不代表简单模型就是好的。

Bet of sparsity 数据少,发现不了复杂的模型,干脆假设模型是简单的

Bet for similarity 假设个体是相似的,可以只用几个模型表达所有的个体

A和B点越接近越好

Key Observations

  • In complicated problems, we need models that are flexible enough to approximate the “true model”.

  • If the complexity of the model is too large, or if too many models are fitted in the same time, overfitting occurs.

  • Number of parameters p and the sample size n are enemies. For larger n, with the same confidence level, a bigger model can be allowed. For larger p, it requires more data, i.e., larger n to obtain good “confidence” on the trained model.

  • Again, the ability of representation is the key! If a model with small size can capture the main pattern of data, then this model is likely to have better generalizability compared to a bigger model.

  • Modelling is important in traditional Machine Learning: Requires good intuitions (or luck?) to choose models and base functions. 传统方法是人为选择模型,现代方法是meta modeling

  • For flexible learners like Neural Networks, such modelling problem does not exist.

  • Complexity of a model comes from two sources: Number of Variables, and structure of the model.

  • Neural Networks is complicated in the structure, but can save slots for the number of variables for approximating any functions.

  • The complexity of Neural Networks is hidden in the complicated connections of neurons.

MetaModeling

一系列base function的集合

举例:

(1) $y=x\beta+\epsilon$ 回归

(2) $|t(x_j)|<C$ not significant

(3)Drop $x_j$,back to (1)

This is t stats-picking,会违反$E[\epsilon|x]=0$的假设,得到的$\beta$是有偏的

One Key question in Machine Learning

The following questions are either equivalent or closely connected.

  • How to increase generalizability of the model (or out of sample performance in statistics)?

  • Which machine learning method captures the pattern of data in the most “efficient” 更简单way? (How to select the “best” model?)

  • How do we know that our model really captures the pattern of data?

  • How to reduce overfitting of the model but keep or increase the effectiveness of the model at the same time?

Review of overfitting for Several Simple Models: Linear Regression

when $n —>\infty,需\lambda —>0$

when $p —>\infty,需\lambda —>\infty$

thumb rule,$\lambda=\sigma \sqrt{\frac{logp}n},\sigma^2=var(\epsilon)$

Review of overfitting for Several Simple Models: Decision Tree

  • A tree structure with each node as $1(x_j ≤ t)$ or $1(x_j > t)$ for some $x_j \in {x_1, ..., x_k}$ and some $t \in R$.

  • Base function is $1(x_j ≤ t)$. Can approximate and decision rules based on x1, ..., xk .

  • Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training set.(Early stopping)

  • Post-pruning that allows the tree to perfectly classify the training set, and then post prune the tree.

  • Bootstrap and model aggregation: Random Forest.

重抽样,每次bootstrap得到一棵树,然后投票。

决策树与神经网络:

决策树的决策边界只能是横向或竖向(二维空间中),比如x≤0,要拟合一个不是这种形式的决策边界需要很多决策树。而神经网络完全可以处理这种情况。

Overview of Methods that help to prevent overfitting

  • Model Selection Methods: AIC, BIC, LASSO, etc..

  • Penalization: LASSO, Ridge, etc..

  • Early Stopping: Try to stop the learner before it overfits.

  • Model Averaging.

  • Cross-Validation.

  • PCA (Non-linear PCA - AutoEncoders).

Overfitting Issues with Deep Neural Networks

  • Neural Networks typically have many parameters with complicated structures.

  • The supreme power of the representing unknown functions of Neural Networks comes along with great danger of overfitting.会把outlier也表达出来

  • Taming overfitting is critical in applying Deep Learning.

Penalization

  • The first way to reduce overfitting is to penalize the coefficients in Deep Neural Networks.

  • In practice, most of the cofficients you trained in the Deep Neural Networks are quite close to 0.

  • In monitored learning, min $L( \psi(x| \beta) − y) + \lambda ||\beta||$.

  • The Norm can be picked as $|| · ||_1, || · ||_2$, and many others.

  • Back propagation still works when || · || is almost everywhere differentiable.

Drawbacks of Penalization

  • When the SGD algorithm does not converge well, even for “sparsity” case , i.e., only a few links/variables are effectly informative on y, the penalization does not help too much.

  • $\lambda$ is difficult to pick. Also, $\lambda$ should be specific/or adaptive to each different weight in the neural network.

  • No clear theoretical garrantee compared to the linear models.

  • Hinton et. al. (2012) points out that penalizing the max of the absolute values of weights starting from the same neuron performs better.

  • This leads to Selecting Neural Network Structures in a data driven manner.

Can we do Network Structure Selection to improve the performance of Neural Networks?

  • Can we select which variables (in the input layer) to use? I.e., drop neurons in the input layer.

  • Can we select the structure of Neural Networks? I.e, drop/add neurons in the middles layers.

  • Can we select the links within the Neural Networks? Non-fully connected neural networks. CNN has won great success in many applications, especially in image recognition.利用图像相邻pixel之间的相关性

  • Can all the works be done in an automated way (black box procedure to pick hyperparameters of Deep Neural Networks)?

An ongoing work of Luo and Liang 2016

Why non-linear variable selection is important?

  • Simplies the structure of neural network.

  • Can be applied repeatly over the layers of the neural network.

  • Data-Adaptive Neural Network structure may have better generalizability and less overfitting.

Traditional Methods for non-linear variable selection

  • Additive model (Ravikumar et al., 2009): Each nonlinear feature is expressed as a basis function of a single variable. It fails to model interactions between variables.

  • Lin and Zhang (2006) encodes more complex interactions among the variables, e.g., the features defined by a second-degree polynomial kernel. The size of the dictionary grows more than exponentially as one considers high-order interactions.

  • Tree Based Approach: It makes the use of internals of the decision tree structure in variable selection. All observations begin in single root node and are split into two groups based on whether $x_k \ge c \text{ or } x_k < c$, where $x_k$ is a chosen splitting variable and c is a chosen splitting point.

$y=f(x)+\epsilon$,fully non-linear

$y=f_1(x_1)+f_2(x_2)+...+f_p(x_p)+\epsilon$,partially linear,有x1x2交互项

An ongoing work of Luo, Liang, Li and Yang 2016

  • Use Deep Neural Networks to approximate f .

  • Only penalize links in the first layer. This layer determines which variables are used in the model.

  • Deep Neural Networks have won a great benefit compared to shallow NN.第一层做variable selection,后面的DNN做universal approximation,而浅层神经网络只有一层,既要负责变量选取又要负责universal approximation

  • When the SGD algorithms works well, L1 and L0 penalities select the correct variables!

  • For L − 0 penality, we use Shen (2008)’s method: Using weighted L − 1 penalty to approximate L − 0 penalty.

  • Our methods works well when n = 10000 and p = 100, s = 5.

  • Does not work well when n = 1000 and p = 2000 (p > n) case.

  • The SGD does not converge well in such a case.

  • The good thing of this procedure is that you can prove convergence rate under good optimization.

penalty on DNN

Selection with DNN

Simulation

Selection Results

Bayesian Neural Networks on Non-parameteric variable selection

  • A similar work is done by Liang (2016). The approach adopts Bayesian Neural Networks.

  • Benefits: Avoid regularization and optimization.

  • Drawbacks: Does not apply for Deep Neural Network structures. Computation cost is high when model becomes too large.

  • Experiments: 995 noisy variables, sample size = 500. Shallow Neural Networks.

Examples

Bayesian Neural Networks for Variable Selection

Add Additional Penality on objective functions

  • Tradition prediction in Time series: $x_{t+1} = f (x_t , x_{t−1}, ...)$, where f is learned directly from data.

  • Auto-regressions.

  • This does not work well when x itself is complicated.

  • Mathieu, Couprie and LeCun (2016), “DEEP MULTI-SCALE VIDEO PREDICTION BEYOND MEAN SQUARE ERROR”.

Additional Penality on objective functions

  • Let Y = {Y1, ...,Yn} be a sequence of frames to predict from input frames X1, ...,Xm from a sequence of video images.

  • Usually we minimize $||Y − G(X)||_2^2$ to obtain some predictor G(x).

  • Adversial Training and adding a loss function based on gredient difference. Image gredient difference loss. So the true loss function to minimize is the original loss function plus loss functions from Adversial training and the loss function from the gredient difference.

$||Y − G(X)||_2^2+||\beta||^2 +||\nabla G(x)||$

Image Prediction by Penalized DNN

Random Dropout

  • Random Dropout is a very special type of regularization for Deep Learning.

  • Srivastava et. al. (2014): Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

  • So far the most successful way of prevent overfitting in Deep Learning. “Best” algrithm to optimize a complex deep neural network. Good for fine tuning.

  • Has a strong tie with Bayesian Model Averaging.

What is Random Dropout

  • In each epoch, randomly drop a few neurons in each layer. Each neuron is dropped with probability p. At test time, the neuron always exists but the weights is multiplied by p.

  • Hueristics1: The representation power of deep neural networks is so strong that even by dropping many neurons, the pattern can be well approximated.

  • Hueristics2: There are many different ways of a neural network the represent the same functions: the representation is non-unique.

  • Hueristics3: The random dropping is performing model averaging, which leads to better generalizability and more robust results.

  • Hueristics4: Adding noise to neurons is helpful in characterizing patterns of data.

相当于多个网络的平均

Optimization in Dropout

  • Training many different Architectures is hard and expensive. Random dropout kills two birds with one stone.

  • Backpropagation after dropping neurons randomly.

  • Pretraining: (1) Using Stacked RBMs: Hinton and Salakhutdinov(2006). (2) Use Deep Boltzman Machines (Salakhutdinov and Hinton, 2009) . (3) Use autoencoders (Vincent et al., 2010).

  • Pretraining before dropout. Pretrained weights are multiplied by 1/p.

  • Fine tuning with dropout.

  • Very good performance in many different experiments.

Results from Srivastava(2012) et. al.

Further Improvements in Pretraining

  • For p being very large, Stacked AutoEncoders are hard to train.

  • Idea: Split and Merge (Luo and Liang (2017)). Use many small AutoEncoders to replace a large AutoEncoder.

  • Step k: Partition the current variables $x_1^{(k)} , ..., x_{qk}^{(k)}$ into m groups. Perform AutoEncoder in each group.

  • Take the low dimensional representation $w_j$ , j = 1, 2, ...,m from each AutoEncoder, then combine these $w_j$s into a new vector called x^{(k+1)}$.

  • The number of variables reduces in an exponential way along with iteration.

Results from Srivastava(2012) et. al.

激活的神经元少,small neural network

Early Stopping

  • For any step-wise algorithms, early stopping is an efficient way to avoid overfitting as well as saving computation.

  • An early stopping criterion is trying to judge whether the algorithm is fitting the pattern or fitting the error terms.

  • Does not require a validation set or cross-validation.

  • Not widely used in Deep Learning.

多用在高维数据中。

而early stopping 不需要计算那么多P(γ)

Constrained Optimization/ Data Augmentation

  • Sometimes one can enforce some weights in the neural networks to be the same. Parameter sharing in Convolutional Neural Networks. This helps with reducing overfitting.

  • Data Augmentation. For some learning tasks, there are intuitions about which part of the data is informative or not. E.g., rotating the picture does not change the identity of what is described in the picture. So one can augment data by rotating the original pictures.

  • Sometimes the number of variables can be large. E.g., pixels in graphs.

  • Directly training the data with neural networks is costly.

  • Dimensionality reduction. Try to obtain features before the main training task.

  • AutoEncoders as a tool for Feature construction.

  • Vicent et. al (2008). Extracting and composing robust features with denoising autoencoders.

  • MNIST data. Compared with many other existing machine learning methods.

  • Use AutoEncoders to construct features, then fine tuning.

Results from Vicent et. al(2008) et. al.

SdA-3:3层stacked autoencoders,DBN: 多层玻尔兹曼机

References

  • Srivastava et. al., 2013, Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

  • Mathieu, Couprie and LeCun, 2016, Deep Multi-Scale Video Prediction Beyond Mean Square Error.

  • Vicent et. al., 2010, Extracting and Composing Robust Features with Denoising Autoencoders.

  • Luo and Liang 2017, A split and merge algorithm for Deep AutoEncoders.

  • F. Liang, 2016, Bayesian Neural Networks for High-Dimensional Nonlinear Variable Selection with a Double Parallel Monte Carlo Algorithm.

  • Luo and Liang 2016, Nonparametric Variable Selection Via Penalized Neural Networks.

  • Kamper et. al., 2014, Unsupervised Neural Network Based Feature Extraction Using Weak Top-Down Constraints.

Day 4

Dynamic Demand Estimation

  • Goal: Build a demand model and construct the ”best” pricing policy function.

    • Observables: consumer’s real actions to prices; control variables: time, weather, location, etc.

    • Build behavior model for consumers

    • What is the (distribution of) maximum price that consumers are willing to pay?

  • Complex environment and multiple factor, non-linearity.

  • What is the fallacy of traditional pricing methods (such as taxi).

  • How to use data to construct complex pricing policies?

Problems

消费者的决定受价格还有其他因素的影响,但是其它因素很难考虑进去

Consumer choice

  • Consumers’ input: call of service/ product.

  • E-platform: propose a price.

  • Consumer: decide to buy (D = 1) or not buy (D = 0).

  • Consumer’s decision: Buy with probability Pr(D = 1|Z, p).

Consumer behavior and probability function

$p_0(Z)$是可预测的部分,$(1+\epsilon)$是不可预测的部分

Methods

  • Linear/Fixed Base: $p_0(X) = X\beta$ , X as a set of transformations of Z.

  • Non-linear: p0(X) non parameteric, use DNN to approximate (Universal approximation).

  • Fixed base: easy to optimize/ understand, bad performance compared to DNN.

qualitative results

  • Willingness to pay is concave in distance.

  • Low convertion rate when distance is far - prices too high.

  • Weather is significantly affecting willingness to pay.

  • Non-linearity is important.

non-parametrics

  • 512 by 512 DNN.

  • Relu activation function.

  • SDG with Dropout.

  • batch normalization.

先有一个baseline,再设计算法,比较是不是有improve

Pricing Optimization

约束条件:保持Q不变,拉格朗日项:$\lambda(Q-\tilde Q)$

Basic Tasks with NLP

  • Relatively Easy Task: Spell Check/Grammar Check, Key Word Search(Search Engine), Finding Synonyms, segmentation (in Chinese), etc..

  • Medium Difficult Task: Parse a sentence, Information Extraction from certain documents, Sentiment Analysis, etc..

  • Hard Task: Machine Traslation, Natural Language Understanding (Semantic Analysis), Understanding polysemic words, Coreference (“it”, “She”), Question and Answering Systems (One stage or multi-stage).

  • Core Difficulty: how to analyze information from unstructured text data?

Nature Language Processing: Before Deep Learning

  • Understanding sentences, paragraphs and text.

  • Before Deep Learning: k-gram HMM models, CRF models, etc..

  • Problems: Each word is atomic: Difficult to work with. Dimensionality too large.

  • There always exists unseen combinations, due to limitation of data. Requires labelling data in many different situations.

Nature Language Processing

  • From the view of statistics, these the dimensionality of these language models are too large. For many methods based on k-grams: if k is too small, it does not capture the relationship between words that are far away. If k is too large, the model’s complexity is too high

  • Lead to smoothing methods. But this is still quite undesirable

  • Need other representations other than atomic representation - dimensionality reduction.

  • Need unsupervised machine learning to utilize the vast amount of texts.

Main Idea

  • Construct a word vector instead of using Atomic word representations.

  • But how to represent a word semantically? Or what is semantics?

  • If these is no labelling data, we can only observe the co-occurance of words in the same sentences, and there relative positions.

  • Hypothesis: The neighbors of a word could represent the means of a word.

  • In Machine Learning, this is a clustering task.

why distributed representation?

  • Fighting curse of dimensionality.

  • Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.

  • Learn simultaneously the word feature vectors and the parameters of that probability function.

Bag of words

  • Look at the sentence: “This morning, I ate a very dilicious meal with my fork.”

  • k-Bag of words: find bag of words that are within distance of k of the word. For example, the word meal’s neighbors with distance of 2 are {very, dilicious, with, my}.

  • Such a method has many drawbacks. First, the geographical neighbors do not necessarily have related semantic meaning.

  • Second, if we increase k, many unrelated words are included, for example, “this morning”.

  • Many other tricks. Downweighting the words that are far away, etc..

Early Methods: SVD

  • A very large co-occurance matrix.

  • Assume that matrix is low rank. SVD decomposition.

  • Cutoff the eigenvectors at where the desired percentage of variance is achieved.

  • Does not work very well. The meanings of words are non-linear functions of neighbors rather than linear.

  • Drastic imbalance of words. Large matrices are difficult to work with.

  • Manually construct forbbiden words list.

Continuous Bag of Words Models

Word2Vec

Mikolov’s Skip Grams.

  • “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)

  • Similar to Continous Bag of Words models. But to maximize the classification of a word based on another word in the same sentence.

  • Better performance in many tasks. Does not restrict to geographical neighbors. Often times the important and meaningful neighbors are identified.

  • Sampling probability $P(w) \propto U(w)^{\frac34}$ . Make the less frequent words being sampled more often.

Performance of Skip Gram

Several Drawbacks of Distributed Word Representation

  • Some information are missing. E.g., the representation vectors sell and buy often look like the same. The bag of words do not count directions of neighbors.

  • Similarity is restricted to word level, but not phrase level. Does not understand operations.

  • Representations need to be re-trained if new words comes in.

word representation in Chinese: A real example

  • Model is trained by 8M Tencent Weixin articles.

  • Model Skip-Gram.

  • Test Simple Single Word Classification using Word2vec.

  • Results are not quite satisfactory in the examples of the next page.

Performance of Skip Gram

Similar words to 微信 QQ 0.752506, 订阅号 0.714340 , QQ 号 0.695578 ,扫一扫 0.695488 ,微信公众号 0.694692, 私聊 0.681655 ,微信公众平台 0.674171, 私信 0.653821, 微信平台 0.651757, 官方 0.643621

Similar words to 中国 亚洲 0.653344, 大国 0.634759, 本国 0.632525, 全球 0.629047, 大陆 0.625511, 国内 0.620223, 亚太地区 0.612389, 美国 0.609993, 全世界 0.601425, 各国 0.588970

Improving Word to Vec?

  • Try to obtain a better representation than Skip-Gram.

  • Idea: Use Trigrams “Subjective-verb-Objective” to train Word to Vec. Avoid predicting and interacting local nuisance words.

  • Alternative way of representation: Bayesian Networks.

  • Key question: how to incorporate Structured data, such as Knowledge graph, dictionary/ encyclopedia data into word representation.

Identify clusters of words

  • A data set (xi , yi ) with labellings yi .

  • yi can include: One word/ word category. E.g., whether a word xi indicates for a location, or a name, etc.. —> Name Entity Recognition. Sentiment Analysis: yi : indicator for good word or bad word. Preditive models: Predicting the words after xi .

  • For the classifier, we can use simple methods like SVM or logistic regression, or we can use Neural Networks.

  • Use Word to Vec as imputs.

  • With a sample of labelled data, train a classifier to detect the boundary of different groups.

  • Generalizability is key! When the labelled data is small, use the original general purpose vector representation. Representation vectors will move around: —> there is a danger of overfitting.

  • If the data set is large, you can train the vec to task.

Problems with Classifying one single word

  • Ambiguities is a big problem. The meaning of a word is only fixed in the context.

  • “I ate the banana with a fork” and “I ate the banana with Katch up”.

  • Use neighbors in the classification as well.

  • Neural Networks or Kernel SVM as classifiers are much better than linear models in these tasks.

Single Layer Recurrent Neural Networks

Recurrent Neural Networks

  • We can add additional layers and form a deep Recurrent neural networks.

  • The machine has memories over the past experience.

  • Can be bidirectional as well. x → y, y → x.

Application in Machine Translation

  • Source Language in English, Target Language, e.g., French.

  • Probabilitics formulation (Bayesian Rule):

    $e = argmaxp(e|f ) = argmax_ep(f |e)p(e)$.

  • The language model is trained on English corpus. The probabilitic model p(e|f ) is trained on parallel corpus.

Traditional Way

  • Alignment: Matching words. E.g., poor → pauvres, the → Les.

  • Hard to work with. And there are many words or phrases can not be translated directly into another language. Non-linear alignment.

  • Positions and directions can be reversed. Alignment is hard to deal with these situation.

Translate Directly with RNN

Main Improvement

  • Memories over the past. Reduces ambiguity.

  • Automatic Alignment.

  • Phrase to Phrase translation, instead of word to word.

  • Even Modern Way: Long short term memory models (LSTM).

Bengio’s work on applying RNN on Machine Translation

  • Learning Phrase Representations using RNN EncoderâASDecoder for

    Statistical Machine Translation (2014).

  • Performance increased by 2-4 percentage points.

  • Phrase to phrase translation.

Performance of Skip Gram

Traditional Ways to Learn time series models

Non-linear time series models and neural networks

  • If the model $y_t = f (y_{t−1}, ..., ) + \epsilon_t$ , where f is non-linear, the traditional model fail to predict well. Biased estimate.

  • Two Mostly used Neural networks: Deep Bolzmann machines and Stochastic Neural network.

  • Idea: Make the neurons being stochastic. Easy to train and approximate the time series better.

Applying Neural Networks to time series

Experiment

Performance of SNN

Sun Spot data

  • Sun spot observations from 1700-1979. Annual data.

  • Predict Sun spot activity in 1980-1987.

  • Better Performances over non-neural network based models.

Performance of SNN:Sun Spots predictions

References

  • Mikolov et. al., 2013, Distributed Representations of Words and Phrases and their Compositionality.

  • Cho, Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio, 2014, Learning Phrase Representations using RNN

    EncoderâASDecoder for Statistical Machine Translation.

  • Lai and Wong, 2001, Stochastic Neural Networks With Applications to Nonlinear Time Series.

A metamodel or surrogate model is a model of the model, i.e. a simplified model of an actual model of a circuit, system, or software like entity.[[Metamodel can be a mathematical relation or algorithm representing input and output relations. A is an abstraction of phenomena in the ; a metamodel is yet another abstraction, highlighting properties of the model itself. A model conforms to its metamodel in the way that a computer program conforms to the grammar of the programming language in which it is written. Various types of metamodels include polynomial equations, neural network, , etc. "Metamodeling" is the construction of a collection of "concepts" (things, terms, etc.) within a certain domain. Metamodeling typically involves studying the output and input relationships and then fitting right metamodels to represent that behavior.

3]
4]
model
real world
Kriging