External Sorting with On-the-Fly Compression

  • John Yiannis
  • Justin Zobel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2712)


Evaluating a query can involve manipulation of large volumes of temporary data. When the volume of data becomes too great, activities such as joins and sorting must use disk, and cost minimisation involves complex trade-offs. In this paper, we explore the effect of compression on the cost of external sorting. Reduction in the volume of data potentially allows costs to be reduced — through reductions in disk traffic and numbers of temporary files — but on-the-fly compression can be slow and many compression methods do not allow random access to individual records. We investigate a range of compression techniques for this problem, and develop successful methods based on common letter sequences. Our experiments show that, for a given memory limit, the overheads of compression outweigh the benefits for smaller data volumes, but for large files compression can yield substantial gains, of one-third of costs in the best case tested. Even when the data is stored uncompressed, our results show that incorporation of compression can significantly accelerate query processing.


Query Processing Compression Technique Compression Scheme Query Evaluation Compression Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zobel, J., Williams, H.E., Kimberley, S.: Trends in retrieval system performance. In Edwards, J., ed.: Proceedings of the Australasian Computer Science Conference, Canberra, Australia (2000) 241–248Google Scholar
  2. 2.
    Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture optimized for the new bottleneck: Memory access. In: The VLDB Journal. (1999) 54–65Google Scholar
  3. 3.
    Graefe, G.: Query evaluation techniques for large databases. ACM Computing Surveys 25 (1993) 152–153CrossRefGoogle Scholar
  4. 4.
    Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. In: Proceedings of ACM SIGMOD international conference on Management of Data, Santa Barbara, California, USA (2001) 271–282Google Scholar
  5. 5.
    Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, IEEE Computer Society (1998) 370–379Google Scholar
  6. 6.
    Graefe, G., Shapiro, L.: Data compression and database performance. In ACM/IEEE-CS Symposium On Applied Computing (1991) 22–27Google Scholar
  7. 7.
    Ng, W.K., Ravishankar, C.V.: Relational database compression using augmented vector quantization. In: Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, IEEE Computer Society (1995) 540–549CrossRefGoogle Scholar
  8. 8.
    Ray, G., Harista, J.R., Seshadri, S.: Database compression: A performance enhancement tool. In: Proceedings of the 7th International Conference on Management of Data (COMAD), Pune, India (1995)Google Scholar
  9. 9.
    Westman, T., Kossmann, D., Helmer, S., Moerkotte, G.: The implementation and performance of compressed databases. ACM SIGMOD Record 29 (2000)Google Scholar
  10. 10.
    Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering 9 (1997) 302–313CrossRefGoogle Scholar
  11. 11.
    Larmore, L.L., Hirschberg, D.S.: A fast algorithm for optimal length-limited Huff-man codes. Journal of the ACM 37 (1990) 464–473zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Second edn. Morgan Kaufmann, San Francisco, California (1999)Google Scholar
  13. 13.
    Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H., Zobel., J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44 (1993) 508–531CrossRefGoogle Scholar
  14. 14.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. (2002) 222–229Google Scholar
  15. 15.
    Williams, H.E., Zobel, J.: Compressing integers for fast file access. Computer Journal 42 (1999) 193–201CrossRefGoogle Scholar
  16. 16.
    Zobel, J., Moffat, A.: Adding compression to a full-text retrieval system. Software Practice and Experience 25 (1995) 891–903CrossRefGoogle Scholar
  17. 17.
    Roth, M., Horn, S.V.: Database compression. ACM SIGMOD Record 22 (1993) 31–39CrossRefGoogle Scholar
  18. 18.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems Implementation. First edn. Prentice Hall (2000)Google Scholar
  19. 19.
    Ramakrishnan, R., Gehrke, J.: Database Management Systems. Second edn. McGraw-Hill (2000)Google Scholar
  20. 20.
    Knuth, D.E.: The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, Massachusetts (1973)Google Scholar
  21. 21.
    Cannane, A., Williams, H.: A general-purpose compression scheme for large collections. ACM Transactions on Information Systems 20 (2002) 329–355CrossRefGoogle Scholar
  22. 22.
    Moffat, A., Turpin, A.: Compression and Coding Algorithms. First edn. Kluwer (2002)Google Scholar
  23. 23.
    Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the Databases Systems for Advanced Applications Symposium, Melbourne, Australia (1997) 215–223Google Scholar
  24. 24.
    Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In Oudshoorn, M., ed.: Proceedings of the Australasian Computer Science Conference, Adelaide, Australia (2003) 11–18Google Scholar
  25. 25.
    Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In Ladner, R., ed.: Proceedings of the ALENEX Workshop on Algorithm Engineering and Experiments, Baltimore, Maryland (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • John Yiannis
    • 1
  • Justin Zobel
    • 1
  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations