Dong, Dapeng
(2018)
Content-aware Partial Compression for Textual
Big Data Analysis in Hadoop.
IEEE TRANSACTIONS ON BIG DATA, 4 (4).
pp. 459-472.
ISSN 2332-7790
Abstract
A substantial amount of information in companies and on the Internet is present in the form of text. The value of this
semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The
ever-increasing data production, however, pushes data analytic platforms to their limit. Compression as an effective means to reduce
data size has been employed by many emerging data analytic platforms, whom the main purpose of data compression is to save
storage space and reduce data transmission cost over the network. Since general purpose compression methods endeavour to
achieve higher compression ratios by leveraging data transformation techniques and contextual data, this context-dependency forces
the access to the compressed data to be sequential. Processing such compressed data in parallel, such as desirable in a distributed
environment, is extremely challenging. This work proposes techniques for more efficient textual big data analysis with an emphasis on
content-aware compression schemes suitable for the Hadoop analytic platform. The compression schemes have been evaluated for a
number of standard MapReduce analysis tasks using a collection of public and private real-world datasets. In comparison with existing
solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.
Item Type: |
Article
|
Additional Information: |
This is the preprint version of the published article, which s available at D. Dong and J. Herbert, "Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop," in IEEE Transactions on Big Data, vol. 4, no. 4, pp. 459-472, 1 Dec. 2018, doi: 10.1109/TBDATA.2017.2721431. |
Keywords: |
Big Data; Compression; MapReduce; Distributed File System; |
Academic Unit: |
Faculty of Science and Engineering > Computer Science |
Item ID: |
13168 |
Identification Number: |
https://doi.org/10.1109/TBDATA.2017.2721431 |
Depositing User: |
Dapeng Dong
|
Date Deposited: |
05 Aug 2020 15:46 |
Journal or Publication Title: |
IEEE TRANSACTIONS ON BIG DATA |
Publisher: |
IEEE |
Refereed: |
Yes |
URI: |
|
Use Licence: |
This item is available under a Creative Commons Attribution Non Commercial Share Alike Licence (CC BY-NC-SA). Details of this licence are available
here |
Repository Staff Only(login required)
|
Item control page |
Downloads per month over past year
Origin of downloads