Similar documents


2



-2-


2



Supply Chain SCM IBM DRP

50 2


XML SOAP DOM B2B B/S B2B B2B XML SOAP



2

Abstract After over ten years development, Chinese securities market has experienced from nothing to something, from small to large and the course of


() MONORCHIIDAE SP


WTO

1

UDC The Policy Risk and Prevention in Chinese Securities Market

Abstract There arouses a fever pursuing the position of being a civil servant in China recently and the phenomenon of thousands of people running to a


1


% 6.7% % % / 1

1998 5




:


THE APPLICATION OF ISOTOPE RATIO ANALYSIS BY INDUCTIVELY COUPLED PLASMA MASS SPECTROMETER A Dissertation Presented By Chaoyong YANG Supervisor: Prof.D

Abstract Since 1980 s, the Coca-Cola came into China and developed rapidly. From 1985 to now, the numbers of bottlers has increased from 3 to 23, and

WTO


WTO OEM

:

X UDC A Post-Evaluation Research on SINOPEC Refinery Reconstruction and Expanding Project MBA 厦门大学博硕士论文摘要库


Abstract Today, the structures of domestic bus industry have been changed greatly. Many manufacturers enter into the field because of its lower thresh

厦 门 大 学 学 位 论 文 原 创 性 声 明 本 人 呈 交 的 学 位 论 文 是 本 人 在 导 师 指 导 下, 独 立 完 成 的 研 究 成 果 本 人 在 论 文 写 作 中 参 考 其 他 个 人 或 集 体 已 经 发 表 的 研 究 成 果, 均 在 文 中 以 适 当 方





厦 门 大 学 学 位 论 文 原 创 性 声 明 本 人 呈 交 的 学 位 论 文 是 本 人 在 导 师 指 导 下, 独 立 完 成 的 研 究 成 果 本 人 在 论 文 写 作 中 参 考 其 他 个 人 或 集 体 已 经 发 表 的 研 究 成 果, 均 在 文 中 以 适 当 方



UDC 厦门大学博硕士论文摘要库




I

- 2 - Russell Thaler unexpected dramatic P t =P t-1 + P t t P t-1 t-1 2 T.Russell and R.Thaler, The Relevance of Quasi-Rationality in Competitiv



University of Science and Technology of China A dissertation for master s degree Research of e-learning style for public servants under the context of

Construction of Chinese pediatric standard database A Dissertation Submitted for the Master s Degree Candidate:linan Adviser:Prof. Han Xinmin Nanjing

Research for RS encoding and decoding technology in the Digital Television Terrestrial Broadcasting System 2006 厦门大学博硕士论文摘要库


2002 II

J. D. 17 Daniel J. Elazar, American Federalism: A View From the States (New York: Happer & Row, Publishers, 1984), p


- 2 -

WTO WTO ATM POS 4 CRM 2

國家圖書館典藏電子全文

豐佳燕.PDF

Microsoft Word 谢雯雯.doc

Abstract / / B-ISDN ATM Crossbar Batcher banyan N DPA Modelsim Verilog Synopsys Design Analyzer Modelsim FPGA ISE FPGA ATM ii

1


untitled

SVM OA 1 SVM MLP Tab 1 1 Drug feature data quantization table

100Mbps 100Mbps 1000Mbps 100Mbps 1000Mbps 100Mbps 100Mbps PD LXT Mbps 100Mbps 100Mbps 1

UDC The Establishment of Fractional BSDE : : : : : 厦门大学博硕士论文摘要库

Kluyveromyces sp. Y-85 Saccharomyces cerevisiae E-15, E g/100ml Y-85 E-15 DNA Y YEPD MM E E-15 Y-85 Y-85 12h E h 0.1%ED



096STUT DOC

X MGC X 22 X 23 X MGC X BALB/c 26 X MGC X MGC X MGC X..

Microsoft Word - A _ doc

报 告 1: 郑 斌 教 授, 美 国 俄 克 拉 荷 马 大 学 医 学 图 像 特 征 分 析 与 癌 症 风 险 评 估 方 法 摘 要 : 准 确 的 评 估 癌 症 近 期 发 病 风 险 和 预 后 或 者 治 疗 效 果 是 发 展 和 建 立 精 准 医 学 的 一 个 重 要 前

國立中山大學學位論文典藏.PDF

Microsoft Word - chnInfoPaper6

Microsoft Word - a8_wu_guangyun

医学科研方法


Microsoft Word - 33-p skyd8.doc

Research on the Mycorrhizal Community of Pinus Massoniana Lamb in Wuyishan Nature Reserve Abstract Mycorrhizae is the symbiosisal body of fugus and bo

Shanghai International Studies University THE STUDY AND PRACTICE OF SITUATIONAL LANGUAGE TEACHING OF ADVERB AT BEGINNING AND INTERMEDIATE LEVEL A Thes


Public Projects A Thesis Submitted to Department of Construction Engineering National Kaohsiung First University of Science and Technology In Partial


厦门大学博硕士论文摘要库

2.

3 : 505.,,,,,,,,,, 21 [1,2 ] , 21,, 21,, : [3 ]. 1. 3,, 10, 2 ( ),,, ; ; 40, [4 ]. 46, : (1),, (2) 16,,,,, (3) 17, (4) 18,, (5) 19,, (6) 20

Transcription:

摘要 随着 Internet 的迅猛发展和日益普及, 网络文本信息急剧增长, 如何有效的 组织和管理这些海量信息, 并能够快速 准确 全面地获得用户所需要的信息是 当今信息科学技术领域面临的一大挑战 文本分类作为处理和组织大量文本数据 的关键技术, 可以在较大程度上解决信息杂乱现象的问题, 方便用户准确地定位 所需要的信息和分流信息 而且作为信息过滤 信息检索 搜索引擎 文本数据库 数字化图书馆等技术基础, 文本分类技术的研究具有重要的理论意义和广泛的应用价值 然而目前大多对文本分类对象还仅是文本内容, 而忽略了网络文本的标题 关键字 摘要等信息对文本类别的贡献, 如何综合利用这些信息提出高效 准确的分类算法是本文的热点研究内容之一 本文提出了通过 KNN 算法对文本各要素进行分类, 再使用贝叶斯定理综合多分类器, 最后通过模拟退火算法协调各要素比重的多要素中文文本分类算法 实验证明该算法能够有效的解决多要素文本分类问题并且与传统的文本分类方法相比有更高的分类准确率 基于 KNN 的多要素中文文本分类协调算法主要包括以下三各方面 : 基于 KNN 算法的中文文本分类研究与实现 主要研究了不同特征维数和不同特征选择算法对分类器的影响, 不同 K 值下 KNN 算法的分类性能 最后根据实验结果, 选择最优的特征维数 特征选择算法以及 K 值对各要素构造分类器, 再使用分类器得到测试文本集的类别信息, 最后对分类结果进行评估 提出并实现基于 Bayes 定理的多分类器协调算法 该算法将 Bayes 定理运用到多分类器的协调上来, 主要根据各分类器的分类结果以及分类器的分类性能, 结合 Bayes 定理重新计算文本分为各个类别的概率 完成模拟退火算法在多要素文本分类上的应用 由于多要素文本中各个要素对文本类别信息的贡献度互不相同, 因此提出将模拟退火算法应用到协调多要素的权重上, 并通过实验证明该方法的可行性与有效性 关键字 :KNN 算法 ; 多要素 ; 文本分类 I

Abstract With the rapid development and spread of Internet, the text information on the Internet grows rapidly. It is a big challenge faced by current information science technology that how to effectively organize and manage this information and get the information user needs quickly, accurately, and comprehensively. Text classification, as the key technology to organize and process large amount of text data, can solve the problem of information disorder and be convenient for the user to accurately locate the information they need. What is more, text classification is the foundation of information filtering, information retrieval, search engines, text database, and digital library. Thus the study of text classification has important theoretical significance and application value. However, the current text classification object is just the text contents. It ignores the importance of text title, keywords, abstract and so on. Therefore, how to comprehensively utilize these information and put forward an efficient and accurate classification algorithm is the research content of this paper. This paper put forward to classify various elements of the text using KNN algorithm, and then synthesize the multiple classifiers by Bayes theorem. Finally, we coordinate the proportion of different elements by simulated annealing algorithm. The experimental results show that our algorithm could effectively solve multiple elements text classification problems and has higher accuracy than traditional algorithm. The multiple elements text classification coordinate algorithm based on KNN contains three parts. The study and implementation of Chinese text classification based on KNN. We studied the influence of different feature selection methods to classifier and the classification performance with different K values. We choose the best feature selection method and K after experiments to conduct KNN classification. We also evaluated the results of classification. III

Put forward and implemented the multiple classifiers coordinate algorithm based on Bayes theorem. This algorithm applies the Bayes theorem in the coordinate of multiple classifiers. According to the classification results and classifier s performance, recalculate the class of the text and get the results. Applied the simulated annealing algorithm into the multiple elements text classification. Because each element in multiple elements text contribute to text classification is different, so bring forward apply the simulated annealing algorithm into the coordinate of different elements. We validate the feasibility and effectiveness of the method by many experiments. Key words: KNN algorithm;multiple elements;text Categorization IV

目录 第一章绪论... 1 1.1 研究背景及意义... 1 1.2 研究内容... 2 1.3 国内外研究现状... 4 1.3.1 国外研究现状... 4 1.3.2 国内研究现状... 5 1.4 本文主要工作... 6 第二章文本分类相关技术... 8 2.1 文本分类系统的工作原理... 8 2.2 文本预处理... 9 2.2.1 中文文本分类的特点... 9 2.2.2 文档集... 9 2.2.3 文档表示模型... 11 2.2.4 中文分词技术... 12 2.2.5 中文停用词处理... 13 2.3 文本特征选择方法... 14 2.3.1 信息增益 (Information Gain)... 15 2.3.2 互信息 (Mutual Information)... 15 2 2.3.3 统计 ( 2 )... 16 2.3.4 交叉熵 (Cross Entropy)... 16 2.3.5 文本频率 (Document Frequency)... 16 2.4 特征权重算法... 17 2.4.1 布尔加权法 (Boolean Weighting)... 17 2.4.2 词频权重... 17 2.4.3 TFIDF 权重... 18 V

2.5 基于统计方法的分类算法... 18 2.5.1 类中心向量算法... 18 2.5.2 朴素贝叶斯算法 (Navie Bayes)... 19 2.5.3 支持向量机分类算法 (SVM)... 20 2.5.4 K 近邻算法 (KNN)... 22 2.6 分类性能评估... 23 2.6.1 单类赋值... 23 2.6.2 多类排序... 25 2.7 本章小结... 26 第三章基于 KNN 的多要素中文文本分类协调算法... 27 3.1 基于 KNN 的中文文本分类系统构建... 28 3.1.1 训练阶段... 28 3.1.2 测试阶段... 34 3.1.3 分类器评估阶段... 36 3.2 基于 Bayes 定理的多分类器协调算法... 37 3.2.1 Bayes 理论相关知识... 37 3.2.2 基于 Bayes 定理的多分类结果协调算法... 38 3.3 模拟退火算法在多要素文本分类上的应用... 41 3.3.1 模拟退火算法... 41 3.3.2 模拟退火算法在多要素文本分类上的应用... 42 3.4 本章小结... 44 第四章实验结果与分析... 45 4.1 语料库说明... 45 4.2 文本分类算法... 45 4.2.1 特征数目对分类效果的影响... 45 4.2.2 特征选择算法对分类效果的影响... 46 4.2.3 KNN 算法中 K 值对分类效果的影响... 47 4.3 基于 KNN 算法的多要素文本分类算法分类效果... 48 4.3.1 KNN 算法对多要素中文文本的分类结果... 48 VI

4.3.2 经协调后的多要素分类算法与传统分类方法对比... 49 4.4 本章小结... 50 第五章结论... 51 5.1 总结... 51 5.2 后续工作... 52 参考文献... 53 攻读硕士学位期间发表的论文... 58 致谢... 59 VII

Contents Chapter 1 Introduction... 1 1.1 Backgroud and Signification... 1 1.2 Problem Description... 2 1.3 Research Status... 4 1.3.1 Research Status Abroad... 4 1.3.2 Research Status in China... 5 1.4 Main Work... 6 Chapter 2 Text Categorization Technology... 8 2.1 Text Categorization's Characteristic... 8 2.2 Text PreTreatments... 9 2.2.1 Chinese Text Categorization's Characteristic... 9 2.2.2 Text Set... 120 2.2.3 Text Describe Model... 121 2.2.4 Word Split... 12 2.2.5 Stop Word... 13 2.3 Feature Select Algorithm... 14 2.3.1 Information Gain... 15 2.3.2 Mutual Information... 15 2.3.3 2 Statistics... 16 2.3.4 Cross Entropy... 16 2.3.5 Document Frequency... 16 2.4 Feature Weight Algorithm... 17 2.4.1 Boolean Weighting... 17 2.4.2 Word Frequency Weighting... 17 2.4.3 TFIDF Weighting... 18 2.5 Text Categorization Algorithm based on Statistics... 18 2.5.1 Class Center Vector Algorithm... 18 IX

2.5.2 Navie Bayes Algorithm... 19 2.5.3 SVM... 20 2.5.4 KNN... 22 2.6 Evaluation... 23 2.6.1 Single class assignment... 23 2.6.2 Multi class sort... 25 2.7 Summary of This Chapter... 26 Chapter 3 Multiple elements Text categorization Coordinate Algorithm Based on KNN... 27 3.1 Multiple elements Text categorization Algorithm based on KNN... 28 3.1.1 The training phase... 28 3.1.2 The testing phase... 34 3.1.3 Evaluation... 36 3.2 Multiple classifier Coordinate Algorithm Base on Bayes Theory... 37 3.2.1 Bayes Theory... 37 3.2.2 Multiple classifier Coordinate Algorithm Base on Bayes Theory... 38 3.3 Multiple elements Text Categorization Coordinate Algorithm Base on SA41 3.3.1 SA... 41 3.3.2 Multiple elements Text Categorization Coordinate Algorithm Base on SA... 42 3.4 Summary of This Chapter... 44 Chapter 4 Experimental Results and Analysis... 45 4.1 Corpus Description... 45 4.2 Text Categorization Algorithm... 45 4.2.1 Effect of the number of feature on the classifier... 45 4.2.2 The influence of feature select algorithm on the classifier... 46 4.2.3 The influence of the K value... 47 4.3 Multiple elements Text CoordinateAlgorithm based on KNN... 48 4.3.1 The result of Multi element text categorization based on KNN... 48 X

4.3.2 Comparison Multi element text categorization... 49 4.4 Summary of This Chapter... 50 Chapter 5 Conclusions... 51 5.1 Conclusions... 51 5.2 Prospections of the Future Work... 52 References... 53 Publiction... 58 Acknowledgement... 59 XI

Degree papers are in the Xiamen University Electronic Theses and Dissertations Database. Full texts are available in the following ways: 1. If your library is a CALIS member libraries, please log on http://etd.calis.edu.cn/ and submit requests online, or consult the interlibrary loan department in your library. 2. For users of non-calis member libraries, please mail to etd@xmu.edu.cn for delivery details.