Article Preview
TopIntroduction
Currently, a large volume of data in various environments is generated. Such a large volume of data consumes valuable resources such as space, network bandwidth, and CPU. In order to save the resources, data compression schemes have been developed and applied in many applications.
However, they do not consider the effect of data reordering. If we reorganize data, the compression ratio for the reorganized data may be improved compared to that for the original data. Some papers deal with data reordering problems in very limited environments (Apaydin, Tosun & Ferhatosmanoglu, 2008; Blandford & Blelloch, 2002; Johnson, Krishnan, & Chhugani, 2004; Pinar, Tao & Ferhatosmanoglu, 2005). The work of Apaydin et al. (2008), Blandford and Blelloch (2002), Johnson et al. (2004), and Pinar et al (2005) assumes that the order of data does not have to be preserved, that is, the input data is unordered data. However, in general, the order of data should be preserved and has the important information. For example, time series data should be ordered by the time. If we change the order of the time series data, it will lose much information. Therefore, the approaches in Apaydin et al. (2008), Blandford and Blelloch (2002), Johnson et al. (2004), and Pinar et al. (2005) cannot be applied to the ordered data such as time series data (Chen, Dong, Han, Wah, & Wan, 2002; Korn, Jagadish, & Faloutsos, 1997; Reeves, Liu, Nath, & Zhao, 2009; Elmeleegy, Elmagarmid, Cecchet, Aref, & Zwaenepoel, 2009).
Consider the run-length encoding with the ordered data D = {1, 1, 1, 3, 3, 1, 1, 4, 4, 4}. The run-length encoding is one of the widely used lossless compression schemes. It replaces repeated values with <value, count>, where count is the number of repeated values. We can represent D by the run-length encoding as follows. Note that RLE(D) is the compressed data for D using the run-length encoding. See Table 1 for the detailed notational convention.
RLE(D) = {<1,3>,<3,2>,<1,2>,4,3>},where in the pair <a,b>, a is value and b is count.
Table 1. Notation | Meaning |
RLE(D) | Compressed data for D using the run-length encoding |
BUCKET(D, ε) | Compressed data for D using the bucketing scheme with an error bound ε |
di | the i-th element in data set D |
di,j | {di, di+1, · · ·, dj−1, dj}, where i≤ j |
<X> | token X in the run-length encoding or the bucketing scheme |
≪X≫ | movement information X |