摘要:进入大数据时代后,无论是互联网数据还是离线数据都成指数级增长,而这些海量数据主要以文本结构化或半结构化文件为主,因此,如何从海量数据中有效快速查找用户需要的的有效数据,提高用户的查找准确率成为一个巨大的挑战。查找文本数据首先需要对文本数据进行精确有效的分类,所以文本分类成为文本数据处理的主要难点。因此,本文的研究目的,在于基于现有的硬件基础下研究高效的海量文本分类算法。63216

本文基于Hadoop研究海量文本的存储和文本分类。首先,设计并实现分布式、高可靠、高可用性的数据存储模块,解决现在海量文本存储困难的问题。然后,提出基于MapReduce的分布式并行中文分词算法,改进MapReduce的InputFormat读取数据模式,解决Hadoop处理小文件效率低下的问题,相比默认的MapReduce中文分词能够提高52倍的分词效率,并能够解决现阶段海量文本分词困难的现状。最后,本文将基于MapReduce分布式计算框架研究海量的web文本分类算法,建立贝叶斯文本分类模型,进行实验验证,本文研究的文本分类算法对于未知文本分类的准确性和召回率高达97%。

毕业论文关键词: HDFS;Hadoop;MapReduce;文本分类;中文分词

Research on massive text classification algorithm

Based on Hadoop

Abstract: After entering the era of big data, whether it is Internet data or offline data are increasing exponentially, and these data mainly text structured or semi-structured documents, therefore, how to effectively search user needs from the valid data in mass data, improve the user search accuracy becomes a great challenge. Finding text data requires accurate and efficient classification of text data, so text categorization becomes the main difficulty of text data processing. Therefore, the purpose of this paper is to study the efficient massive text classification algorithm based on the existing hardware.

This paper studies the storage of massive text and text classification based on Hadoop. First, we design and implement a distributed, high reliability and high availability data storage module, which can solve the problem of massive text storage. Then, the proposed MapReduce distributed parallel Chinese segmentation algorithm based on improved MapReduce InputFormat read data model, to solve the problem of low efficiency of Hadoop with small files, compared Chinese word MapReduce default can increase 52 times the word segmentation efficiency, and can solve the present situation of massive text segmentation difficult. Finally, the web text classification algorithm based on MapReduce distributed computing framework of mass, establish the classification model, experimental verification, text classification algorithm is proposed in this paper on the accuracy of unknown text classification up to 97%.

Keywords: HDFS; Hadoop; MapReduce; Text categorization; Chinese word segmentation

目录

1 引言 1

1.1 研究背景 1

1.2 国内外研究现状 2

1.2.1大数据国内外研究现状 2

1.2.2文本分类研究现状 4

1.3 主要工作 4

1.4 论文组织结构 5

2 大数据技术HADOOP的研究 6

上一篇:java+mysql网上图书销售系统的设计与实现
下一篇:asp.net培训中心考试系统开发与建设

Android手机考勤平台的设计与实现

基于android的环境信息管理系统设计

java+mysql班级评优系统的设计实现

Python+mysql宠物领养平台的设计与实现

ASP.NET飞翔租贷汽车公司信...

基于激光超声检测金属材...

多频激励下典型非线性系统的振动特性研究

从政策角度谈黑龙江對俄...

上海居民的社会参与研究

STC89C52单片机NRF24L01的无线病房呼叫系统设计

酵母菌发酵生产天然香料...

浅论职工思想政治工作茬...

AES算法GPU协处理下分组加...

基于Joomla平台的计算机学院网站设计与开发

压疮高危人群的标准化中...

提高教育质量,构建大學生...

浅谈高校行政管理人员的...