2019/4/29 1 补充 : 概率图及主题模型
2019/4/29 2 课程内容 第 1 章绪论 第 2 章布尔检索及倒排索引 第 3 章词项词典和倒排记录表 第 4 章索引构建和索引压缩 第 5 章向量模型及检索系统 第 6 章检索的评价 第 7 章相关反馈和查询扩展 第 8 章概率模型 第 9 章基于语言建模的检索模型 第 10 章文本分类 第 11 章文本聚类 补充 : 概率图及主题模型 补充 : 数据挖掘经典算法概述 第 12 章 Web 搜索 第 13 章多媒体信息检索 第 14 章其他应用简介
2019/4/29 3 后续课程安排 4 月 29 日, 补充 : 概率图及主题模型 5 月 6 日, 补充 : 数据挖掘经典算法概述 (1) 5 月 8 日, 补充 : 数据挖掘经典算法概述 (2) 5 月 13 日, 第 12 章 Web 搜索 5 月 15 日, 第 13 章多媒体信息检索 5 月 20 日, 复习 5 月 22 日, 同学们文献阅读报告 5 月 27 日, 同学们文献阅读报告 6 月 3 日, 期末考试 暂定
2019/4/29 4 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例
2019/4/29 5 主要参考书目 Probabilistic Graphical Models (Principles and Applications) Luis Enrique Sucar, 2015 Probabilistic Graphical Models (Principles and Techniques) Daphne Koller & Nir Friedman, 2009 Pattern Recognition and Machine Learning Christopher M.Bishop, 2006
2019/4/29 6 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例
2019/4/29 7 Graphical Model( 概率图模型 ) Probabilistic Graphical Models (PGMs) 概率图模型是一类用图形模式表达基于概率相关关系的模型的总称 概率图模型结合概率论与图论的知识, 利用图来表示与模型有关的变量的联合概率分布 近 10 年它已成为不确定性推理的研究热点, 在人工智能 机器学习和计算机视觉等领域有广阔的应用前景 概率图理论共分为三个部分 Representation: 概率图模型表示理论 Inference: 概率图模型推理理论 Learning: 概率图模型学习理论
2019/4/29 8 概率图示例朴素贝叶斯分类器 概率图 独立性的假设 Bayes 公式 :P(c d) P(d c) Graphical Model c t nd t 1... t 2 t k
2019/4/29 9 概率图示例用有向图表示统计模型 A data set of N points generated from a Gaussian: 盘式记法 Lei Zhang/Lead Researcher, Microsoft Research Asia, 2012-04-17, USTC
2019/4/29 10 概率图的记法 展开的概率图 盘式记法 盘式记法 (plate notation) 是一种常用的图模型的简化记法 在盘式模型中, 用一个框 ( 称为盘 ) 圈住图模型中重复的部分, 并在框内标注重复的次数 盘式记法能够为我们表示和分析许多概率模型提供很大的方便, 但它也有一定的局限性 例如, 它无法表示盘内变量不同拷贝间的相关性, 而这种相关性广泛出现于动态贝叶斯网络中 展开的概率图 盘式记法
2019/4/29 11 Naïve Bayes model( 盘式记法 ) c w N 模型参数 : 使后验概率 p(c d) 最大的参数集 参数集 : p(c i ),i=1,2,, 类别总数 p(w k c), j=1,2,, 词项总数 c arg max c p( c w) p( c) p( w c) p( c) N n 1 p( w n c) Object class decision Prior prob. of the object classes Image likelihood given the class
2019/4/29 12 小结 : 什么是概率图模型 Graphical Model( 概率图模型 ) 是一类用图形模式表达基于概率相关关系的模型的总称 概率图的表示方法 展开的概率图 盘式记法 概率图求解 优化 概率图对应参数的求解 ( 如朴素贝叶斯分类器中的参数 )
2019/4/29 13 概率图模型的表示 Representation 结构 :G(V,E) 参数 :CPTs A Bayesian network (BN) represents the joint distribution of a set of n (discrete) variables, X1, X2,..., Xn, as a directed acyclic graph (DAG) and a set of conditional probability tables (CPTs). Each node, that corresponds to a variable, has an associated CPT that contains the probability of each state of the variable given its parents in the graph. The structure of the network implies a set of conditional independence assertions, which give power to this representation. A PGM is specified by two aspects: (i) a graph, G(V, E), that defines the structure of the model; and (ii) a set of local functions, f (Yi ), that define the parameters, where Yi is a subset of X. The joint probability is obtained by the product of the local functions: where K is a normalization constant. This representation in terms of a graph and a set of local functions (called potentials) is the basis for inference and learning in PGMs. Advances in Computer Vision and Pattern Recognition Luis Enrique Sucar, Probabilistic Graphical Models (Principles and Applications), 2015
2019/4/29 14 概率图模型的推理 Inference 含有 5 个变量的贝叶斯网络及其表示的联合分布 P(A,B,C,D,E)=P(A)P(B A)P(C B)P(D B)P(E C,D) 如果观测到变量 E=e, 给定证据 (Evidence) 想要计算变量 C=c 的条件概率 P(c e), 推理 (inference) 则 精确推理近似推理
2019/4/29 15 概率图模型的学习 Learning 概率图结构已知, 即为参数的学习 ( 估计 ) 常用的学习方法有两类 : 最大似然估计 (MLE) 贝叶斯估计 前者视模型参数为定值, 后者视其为随机变量 MLE 在数据完备的情况下, 可将参数学习问题转化为充分统计量的计算问题, 在数据不完备的情况下, 采用 EM 算法, 用迭代方式逐步最大化 p(x θ) 贝叶斯估计在数据完备的情况下, 根据误差准则不同, 可以诱导出最大后验估计或者后验均值的估计方法, 在数据不完备的情况下, 可以将 θ 视为一种特殊的隐变量, 从而问题归结为推理问题, 可以采用变分贝叶斯方法近似求解 概率图结构未知 数据完备时, 较好的方式是定义一个得分函数, 评估结构与数据的匹配程度, 然后搜索最大得分的结构 实际中需要根据奥克姆剃刀原理, 选择可以拟合数据的最简单模型 如果预先假定结构是树模型 ( 每个节点至多有一个父节点 ), 则搜索可在多项式时间内完成, 否则是 NP-hard 问题 数据不完备, 需考虑 structural EM 算法
2019/4/29 16 小结 : 表示 推理 学习 Representation, Inference, Learning Representation a graph 结构 :G(V,E) a set of local functions (called potentials) 参数 :CPTs Inference answering different probabilistic queries based on the model and some evidence. obtain the marginal or conditional probabilities of any subset of variables Z given any other subset Y. Learning given a set of data values for X (that can be incomplete) estimate the structure (graph) and parameters (local functions) of the model.
2019/4/29 17 概率图模型的常见类型 Directed Acyclic Graph Undirected Graph Advances in Computer Vision and Pattern Recognition Luis Enrique Sucar, Probabilistic Graphical Models (Principles and Applications), 2015
2019/4/29 18 常见模型图示 Hidden Markov Models Markov Random Fields Bayesian Networks Decision Graphs Markov Decision Processes Relational Probabilistic Graphical Models Graphical Causal Models
2019/4/29 19 小结 : 什么是 Graphical Model Graphical Model( 概率图模型 ) Probabilistic Graphical Models (PGMs) 展开的概率图 盘式记法 概率图理论共分为三个部分 Representation Inference Learning 概率图模型的常见类型 贝叶斯网络采用有向无环图 (Directed Acyclic Graph) 马尔可夫随机场则采用无向图 (Undirected Graph)
2019/4/29 20 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例
2019/4/29 21 什么是主题模型? 概念示意 We assume that some number of \topics," which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. The topics and topic assignments in this gure are illustrative - they are not fit from real data. Probabilistic topic models, DM Blei, Communications of the ACM, 2012 Retrieved: 2017.04.06 Google cited: 1622
2019/4/29 22 什么是主题模型? 例子 We fit a 100-topic LDA model to 17,000 articles from the journal Science. At left is the inferred topic proportions for the example article( 上页图所示文章 ). At right are the top 15 most frequent words from the most frequent topics found in this article. Probabilistic topic models, DM Blei, Communications of the ACM, 2012 Retrieved: 2017.04.06 Google cited: 1622
2019/4/29 23 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例
2019/4/29 24 LSA(Latent Semantic Analysis) 词项 - 文档矩阵的 SVD 分解, 发现相关文档 文档集 原始的 Term-Document 矩阵 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R.(1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. 10
2019/4/29 25 LSA(Latent Semantic Analysis) 词项 - 文档矩阵的 SVD 分解, 发现相关文档 保留 S 0 的最大两个奇异值 C =UΣV T X=T 0 S 0 D 0 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R.(1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. 10
2019/4/29 26 LSA(Latent Semantic Analysis) 词项 - 文档矩阵的 SVD 分解, 发现相关文档 原始的 Term-Document 矩阵 X Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R.(1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. 10
documents documents topics topics 2019/4/29 27 小结 : 隐语义分析 1990: Latent Semantic Analysis (LSA) D = {d1,,dn} N documents W = {w1,,wm} M words Nij = #(di,wj) NxM co-occurrence term-document matrix Singular Value Decomposition words topics topics words NxM = NxK x KxK x KxM
2019/4/29 28 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例 Probabilistic Latent Semantic Analysis, 1999 Thomas Hofmann, University of California, Berkeley, CA 2016.04 google cited: 1873 https://scholar.google.com/citations?user=t3haylkaaaaj&hl=zh-cn Retrieved: 20170407
2019/4/29 29 主题模型 plsa 和 LDA 都是主题模型 In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. 基本思想 :(1) 文档是若干主题的混合分布 ;(2) 每个主题又是一个关于单词的概率分布
2019/4/29 30 plsa(probabilistic Latent Semantic Analysis) d d z z 1 z 2 z 3 z N w w 1 w 2 w 3 w N N M M M: 文档数目 N: 文档 d 中的词项数目 Lei Zhang/Lead Researcher, Microsoft Research Asia, 2012-04-17, USTC z 1,, z N are variables. z i є[1,k]. K is the number of latent topics.
2019/4/29 31 plsa probabilistic Latent Semantic Analysis n(d,w) 表示文档 d 中 w 出现的次数 d 1 d 2 d M z 1 z 2 z N1 z 1 z 2 z N2 z 1 z 2 z Nm w 1 w 2 w N1 w 1 w 2 w N2 w 1 w 2 w Nm p(w z=1), p(w z=2), p(w z=n M ) are shared for all documents. Likelihood: Lei Zhang/Lead Researcher, Microsoft Research Asia, 2012-04-17, USTC
2019/4/29 32 Joint Probability vs Likelihood n(d,w) 表示文档 d 中 w 出现的次数 Joint probability Likelihood (only for observed variables) p(d) is assumed to be uniform
2019/4/29 33 plsa Objective Function plsa tries to maximize the log likelihood: Due to the summation over z inside log, we have to resort to EM.
2019/4/29 34 Expectation Maximization (EM) algorithm The EM algorithm is a method for ML learning of parameters in latent variable models. E-Step 根据已经估计的参数计算隐藏变量的后验概率 ( 即根据参数计算似然函数的期望 ) M-Step 根据已经计算的后验概率更新参数 ( 选择参数使似然最大化 )
2019/4/29 35 plsa EM Steps The E-Step: 根据参数计算似然函数的期望 The M-Step: 选择参数使似然最大化
2019/4/29 36 plsa vs LSA Each document can be decomposed as: This is similar to the matrix decomposition. p(w d) = Z V k p(z d) z arg z max p( z d ) plsa LSA
2019/4/29 37 plsa vs LSA LSA and plsa perform dimensionality reduction In LSA, by keeping only K singular values In plsa, by having K aspects The main difference is the way the approximation is done plsa generates a model (aspect model) and maximizes its predictive power Selecting the proper value of K is heuristic in LSA Model selection in statistics can determine optimal K in plsa
2019/4/29 38 plsa 用于图像分类 D d z w N face
2019/4/29 39 plsa 应用 :Scene Classification Bosch, A., Zisserman, A. and Munoz, X. Scene Classification via plsa, ECCV 2006, 2016.04 google cited: 742
2019/4/29 41 Classification Result Bosch, A., Zisserman, A. and Munoz, X. Scene Classification via plsa, ECCV 2006, 2016.04 google cited: 742
2019/4/29 42 小结 :plsa topic model:(1) 文档是若干主题的混合分布 ;(2) 每个主题又是一个关于单词的概率分布 d z w N plsa M
2019/4/29 43 小结 : 如何生成 M 份包含 N 个单词的文档 3 种文档生成模型 : (a)unigram (b)mixture of unigrams (c) plsa 没有主题 一个文档只有一个主题 文档可以包含多个主题
2019/4/29 44 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例 Latent dirichlet allocation David M. Blei, Andrew Y. Ng, Michael I. Jordan Journal of Machine Learning Research, 2003 2016.04 google cited: 14167
2019/4/29 45 吴恩达 (1976-, 英文名 :Andrew Ng), 华裔美国人 1976 年生于英国, 之后在香港和新加坡 ; 1992 年毕业于新加坡莱佛士书院 ; 1997 年获得卡内基梅隆大学计算机科学学士学位 ; 1998 年获得麻省理工硕士学位 ; 2002 年在加州大学伯克利分校获得博士学位 ; 2002 年 9 月 - 斯坦福大学计算机科学系和电气工程系的副教授, 斯坦福人工智能实验室的主任 ; 2011 年 1 月 2012 年 6 月, 创办并领导 Google 深度学习项目 ; 2012 年 1 月 - 今,Coursera 联合创始人 2014 年 5 月 16 日, 吴恩达加入百度, 担任百度公司首席科学家, 负责百度研究院的领导工作, 尤其是 Baidu Brain 计划 2017 年 03 月 22 日, 吴恩达在社交平台发布公开信, 宣布自己将从百度离职, 开启自己在人工智能领域的新篇章 Latent dirichlet allocation David M. Blei, Andrew Y. Ng, Michael I. Jordan Journal of Machine Learning Research, 2003 2016.04 google cited: 14167
2019/4/29 46 Problems in plsa plsa provides no probabilistic model at the document level. Each doc has its own topic mixture proportion. The number of parameters in the model grows linearly with M (the number of documents in the training set).
2019/4/29 47 Problems in plsa There is no constraint for distributions p(z d i ). p(z d 1 ) p(z d 2 ) p(z d m ) d 1 d 2 d m z 1 z 2 z N1 z 1 z 2 z N2 z 1 z 2 z Nm w 1 w 2 w N1 w 1 w 2 w N2 w 1 w 2 w Nm Easy to lead to serious problems with over-fitting.
2019/4/29 48 The LDA Model z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 For each document, Choose ~Dirichlet( ) For each of the N words w n : Choose a topic z n ~ Multinomial( ) Choose a word w n from p(w n z n,b), a multinomial probability conditioned on the topic z n. b
2019/4/29 50 The LDA Model 文档是如何生成的? For each document, Choose ~p( ), Dirichlet( ) For each of the N words w n : Choose a topic z n ~p(z ), Multinomial( ) Choose a word w n ~p(w z), from p(w n z n,b), a multinomial probability conditioned on the topic z n.
2019/4/29 51 LDA 的文档生成模型 1. corpus-level ( 红色 ):α 和 β 是语料级别的参数, 对于每个文档都是一样的, 因此在 generate 过程中只需要 sample 一次 2.document-level ( 橙色 ):θ 是文档级别的参数, 意即每个文档的 θ 参数是不一样的, 也就是说每个文档产生 topic z 的概率是不同的, 所以对于每个文档都要 sample 一次 θ 3. word-level ( 绿色 ): 最后 z 和 w 都是文档级别的变量,z 由参数 θ 产生, 之后再由 z 和 β 共同产生 w, 一个 w 对应一个 z
2019/4/29 52 几何学解释 The mixture of unigrams places each document at one of the corners of the topic simplex. The plsi model induces an empirical distribution on the topic simplex denoted by x. LDA places a smooth distribution on the topic simplex denoted by the contour lines. Latent dirichlet allocation David M. Blei, Andrew Y. Ng, Michael I. Jordan Journal of Machine Learning Research, 2003 2016.04 google cited: 14167
2019/4/29 53 Joint Probability Given parameter α and β where
2019/4/29 54 Likelihood Joint Probability Marginal distribution of a document Likelihood over all the documents
2019/4/29 55 CVPR2015 的 LDA 分析 http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/
2019/4/29 56 小结 :LDA latent Dirichlet allocation (LDA) A generative probabilistic LDA is a three-level hierarchical Bayesian model. ~p( ), Dirichlet( ) topic z n ~p(z ), Multinomial( ) word w n ~p(w z), from p(w n z n,b), multinomial
2019/4/29 57 LDA 的局限性 bag of words 的假设 主题的词项分布是不随时间变化的 主题的数目是已知并固定的 忽略主题之间的相关性 For each document, Choose ~p( ), Dirichlet( ) For each of the N words w n : Choose a topic z n ~p(z ), Multinomial( ) Choose a word w n ~p(w z), from p(w n z n,b), a multinomial probability conditioned on the topic z n. Probabilistic topic models, DM Blei, Communications of the ACM, 2012 Retrieved: 2017.04.06 Google cited: 1622
https://scholar.google.com.hk/citations?user=8oye6ieaaaaj&hl=zh-cn Retrieved:2017-04-06 2019/4/29 58 LDA 改进 http://www.cs.columbia.edu/~blei/ 语料库中的主题随时间变化 考虑主题间的相关性,Dirichlet a log-normal
2019/4/29 59 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例 A bayesian hierarchical model for learning natural scene categories Li Fei-Fei, Pietro Perona, CVPR 2005, 2016.04 google cited: 2942
2019/4/29 60 https://scholar.google.com/citations?user=rdfyqniaaaaj&hl=zh-cn retrieved: 20170407 2016 年 11 月, 谷歌宣布李飞飞加入其云团队 生于北京, 长在四川,16 岁随父母移居美国 现为斯坦福大学计算机系终身教授, 人工智能实验室与视觉实验室主任 李飞飞教授主要研究方向为机器学习 计算机视觉 认知计算神经学, 侧重大数据分析为主 1999 年获普林斯顿大学本科学位,2005 年获加州理工学院电子工程博士学位 2009 年她加入斯坦福大学任助理教授, 并于 2012 年担任副教授 ( 终生教授 ), 此前分别就职于普林斯顿大学 (2007-2009) 伊利诺伊大学香槟分校 (2005-2006) 李飞飞教授为 TED 2015 大会演讲嘉宾
2019/4/29 62 从文本分类 图像分类
2019/4/29 63 Hierarchical Bayesian text models beach Latent Dirichlet Allocation (LDA) D c N z w
2019/4/29 64 Codebook A codebook obtained from 650 training examples from all 13 categories (50 images from each category). Image patches are detected by a sliding grid and random sampling of scales. The codewords are sorted in descending order according to the size of its membership. Interestingly most of the codewords appear to represent simple orientations and illumination patterns, similar to the ones that the early human visual system responds to.
2019/4/29 65 Theme Model for scene categorization c p(c η), multinomial(η) π p(π c, θ), Dir (θ) zn multinomial (π) xn p(xn zn, β), A bayesian hierarchical model for learning natural scene categories Li Fei-Fei, Pietro Perona, CVPR 2005, 2016.04 google cited: 1942 (a) Theme Model 1 for scene categorization that shares both the intermediate level themes as well as feature level codewords. (b) ThemeModel 2 for scene categorization that shares only the feature level codewords; (c) Traditional texton model
2019/4/29 66 Topic Distribution in Different Categories Internal structure of the models learnt for each category. Each row represents one category. The left panel shows the distribution of the 40 intermediate themes. The right panel shows the distribution of codewords as well as the appearance of 10 codewords selected from the top 20 most likely codewords for this category model.
2019/4/29 67 Topic Hierarchical Clustering Dendrogram of the relationship of the 13 category models based on theme distribution. y-axis is the pseudo-euclidean distance measure between models.
2019/4/29 68 More Topic Models Hierarchical Dirichlet Process, Journal of the American Statistical Association 2003 Correlated Topic Model, NIPS 2005 Dynamic topic models, ICML 2006 Nonparametric Bayes pachinko allocation, UAI 2007 Supervised LDA, NIPS 2007 MedLDA Maximum Margin Discrimant LDA, ICML 2009 Online learning for latent dirichlet allocation, NIPS 2010 Hierarchically supervised latent Dirichlet allocation, NIPS 2011 A spectral algorithm for latent dirichlet allocation, NIPS 2012 TopicRNN: Combine RNN and Topic Model, ICLR 2017, Autoencoding Variational Inference For Topic Models, ICLR 2017 Neural Relational Topic Models for Scientific Article Analysis, CIKM 2018
2019/4/29 69 小结 : 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model, 2009
2019/4/29 70 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例
2019/4/29 71 主题模型的 R 语言实现示例 在 R 语言中, 有两个包 (package) 提供了 LDA 模型 :lda 和 topicmodels lda 提供了基于 Gibbs 采样的经典 LDA MMSB(the mixed-membership stochastic blockmodel ) RTM( Relational Topic Model) 和基于 VEM(variational expectation-maximization) 的 slda (supervised LDA) RTM. topicmodels 基于包 tm, 提供 LDA_VEM LDA_Gibbs CTM_VEM(correlated topics model) 三种模型 可视化包 LDAvis 包 http://blog.csdn.net/sinat_26917383/article/details/51547298 retrieved: 2017.04.06
2019/4/29 72 宋词的词频 东风夜放花千树 : 对宋词进行主题分析初探 http://chengjunwang.com/cn/2013/09/topic-modeling-of-song-peom/
2019/4/29 73 根据词与词之间共现概率对词进行聚类 长度大于 1 词聚类的效果如图所示 东风夜放花千树 : 对宋词进行主题分析初探 http://chengjunwang.com/cn/2013/09/topic-modeling-of-song-peom/
2019/4/29 74 根据词与词之间共现概率对词进行聚类 对长度大于 2 的词的聚类结果如图所示, 可见宋词的确注重 风流倜傥, 连分类都和风向有关系 东风夜放花千树 : 对宋词进行主题分析初探 http://chengjunwang.com/cn/2013/09/topic-modeling-of-song-peom/
2019/4/29 75 根据词与词之间共现概率对词进行聚类 长度大于 3 的词聚类结果如图所示 东风夜放花千树 : 对宋词进行主题分析初探 http://chengjunwang.com/cn/2013/09/topic-modeling-of-song-peom/
2019/4/29 76 主题网络图 topicmodels 这个 R 包是由 Bettina Grun 和 Johannes Kepler 两个人贡献的, 目前支持 VEM(variational expectation-maximization), VEM (fixed alpha),gibbs 和 CTM(correlated topics model) 四种主题模型, 关于其详细介绍, 可以阅读他们的论文, 关于主题模型的更多背景知识可以阅读 Blei 的相关文章 Gibbs 东风夜放花千树 : 对宋词进行主题分析初探 http://chengjunwang.com/cn/2013/09/topic-modeling-of-song-peom/ CTM
2019/4/29 77 小结 : 主题模型的 R 语言实现示例 R 语言的已有 package lda topicmodels 可视化包 :LDAvis
2019/4/29 78 概率图及主题模型 Probabilistic Graphical Models / Topic Model 什么是 Graphical Model 定义 示例 Representation Inference Learning 主题模型与分类 LSA (Latent Semantic Analysis), 1990 plsa (probabilistic Latent Semantic Analysis), 1999 LDA(Latent Dirichlet Allocation), 2003 Hierarchical Bayesian model 主题模型的 R 语言实现示例 盘式记法