基于hadoop的海量文本分类算法研究

摘要：进入大数据时代后，无论是互联网数据还是离线数据都成指数级增长，而这些海量数据主要以文本结构化或半结构化文件为主，因此，如何从海量数据中有效快速查找用户需要的的有效数据，提高用户的查找准确率成为一个巨大的挑战。查找文本数据首先需要对文本数据进行精确有效的分类，所以文本分类成为文本数据处理的主要难点。因此，本文的研究目的，在于基于现有的硬件基础下研究高效的海量文本分类算法。63216

本文基于Hadoop研究海量文本的存储和文本分类。首先，设计并实现分布式、高可靠、高可用性的数据存储模块，解决现在海量文本存储困难的问题。然后，提出基于MapReduce的分布式并行中文分词算法，改进MapReduce的InputFormat读取数据模式，解决Hadoop处理小文件效率低下的问题，相比默认的MapReduce中文分词能够提高52倍的分词效率，并能够解决现阶段海量文本分词困难的现状。最后，本文将基于MapReduce分布式计算框架研究海量的web文本分类算法，建立贝叶斯文本分类模型，进行实验验证，本文研究的文本分类算法对于未知文本分类的准确性和召回率高达97%。

毕业论文关键词： HDFS；Hadoop；MapReduce；文本分类；中文分词

Research on massive text classification algorithm

Based on Hadoop

Abstract: After entering the era of big data, whether it is Internet data or offline data are increasing exponentially, and these data mainly text structured or semi-structured documents, therefore, how to effectively search user needs from the valid data in mass data, improve the user search accuracy becomes a great challenge. Finding text data requires accurate and efficient classification of text data, so text categorization becomes the main difficulty of text data processing. Therefore, the purpose of this paper is to study the efficient massive text classification algorithm based on the existing hardware.

This paper studies the storage of massive text and text classification based on Hadoop. First, we design and implement a distributed, high reliability and high availability data storage module, which can solve the problem of massive text storage. Then, the proposed MapReduce distributed parallel Chinese segmentation algorithm based on improved MapReduce InputFormat read data model, to solve the problem of low efficiency of Hadoop with small files, compared Chinese word MapReduce default can increase 52 times the word segmentation efficiency, and can solve the present situation of massive text segmentation difficult. Finally, the web text classification algorithm based on MapReduce distributed computing framework of mass, establish the classification model, experimental verification, text classification algorithm is proposed in this paper on the accuracy of unknown text classification up to 97%.

Keywords: HDFS; Hadoop; MapReduce; Text categorization; Chinese word segmentation

1 引言 1

1.1 研究背景 1

1.2 国内外研究现状 2

1.2.1大数据国内外研究现状 2

1.2.2文本分类研究现状 4

1.3 主要工作 4

1.4 论文组织结构 5

2 大数据技术HADOOP的研究 6

上一篇：java+mysql网上图书销售系统的设计与实现

下一篇：asp.net培训中心考试系统开发与建设

基于hadoop的海量文本分类算法研究

《信息技术课程标准》微课的设计与制作

《读书交流分享》APP的设计与开发

《信息技术课程标准》系列微课的设计与制作

《大学生课堂教学管理与...

教育技术学专业技能学习网站的设计

基于Android的电子拍卖系统设计与开发

基于Web应用的致胜公司企业内部培训系统设计

浅谈农村大气环境保护的制度构建【1868字】

激光模拟训练器材国内外研究现状

个案管理茬老年糖尿病患...

日语论文中日酒文化对比研究

大淘宝网的虚假交易研究

发酵米粉优势菌株的发酵特性研究

新疆农林高校學生昆虫生...

肢体语言在小学英语教学中的应用浅谈

淮安市高校足球运动损伤问卷调查表

2021年什么行业赚钱，适合...