r3novatio's blog

By r3novatio, history, 4 years ago, In English

I read somewhere that the C++ STL parital_sort function (used to find k smallest numbers) has a running time complexity of O(n.logk).

It achieves this by going through the entire data, maintaining a k-sized max-heap, throwing away the top whenever size exceeds k, and in the end printing the k remaining elements from the heap.

Can't an n-sized min-heap be used instead and then the first k smallest numbers simply extracted from it. I understand it won't remain in-place and would require additional memory.

But time complexity wise, it would take O(n+k.logn) which is better than O(n.logk), right? (assuming k to be any number smaller than n)

Then why is the O(n.logk) version preferred? Why is it mentioned everywhere and used by the std template?

  • Vote: I like it
  • +21
  • Vote: I do not like it

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

That would require using a heap which has $$$O(1)$$$ insertion, and those usually have a really bad constant.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    How is the O(n.logk) achieved then? If not using heaps/PQs.

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      For the $$$O(n$$$ $$$log$$$ $$$k)$$$ algorithm you can have a heap with both $$$O(log$$$ $$$size)$$$ insertion and deletion of first element, for which an ordinary binary heap works which is pretty fast in practice. For the $$$O(n + k$$$ $$$log$$$ $$$k)$$$ algorithm you would need to have a heap with $$$O(1)$$$ insertion and $$$O(log$$$ $$$size)$$$ deletion of first element, which usually have a much higher constant factor.

      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it +15 Vote: I do not like it

        You do not need to insert all elements one by one. That would take O(n.logn) just to make the heap.

        If I remember my Data Structures class correctly, one can simply use heapify to build a heap of n numbers in O(n) time. Doesn't necessarily need O(1) insertion for that.

        • »
          »
          »
          »
          »
          4 years ago, # ^ |
            Vote: I like it +1 Vote: I do not like it

          Yeah, that's actually true. In that case I don't know the answer.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    a heap could be made in a complexity of O(n) by using the function makeheap(). It is a practical algorithm .It works in the way of making heap from the bottom to the top. In this problem,we don't need to insert the element N times.

»
4 years ago, # |
  Vote: I like it +16 Vote: I do not like it

It can be implemented using nth_element and sort in-place and $$$O(n + klogk)$$$ time. So every bigger time seems strange to me.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +5 Vote: I do not like it

    nth-element guarantees only average complexity iirc

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Doesn't it also use median of medians, which guarantees complexity?

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      We can use median of medians as pivot-selecting method and it will ensures Linear complexity, while afaik nth_element uses quickselect which can lead us to bad partitions and only ensures linear complexity on an average.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +10 Vote: I do not like it

    I just realised another "theoretical" way to achieve this $$$O(n+k\log k)$$$ bound. We could heapify the array in $$$O(n)$$$ time. Then construct another heap (starting from empty) and insert $$$(value,index)$$$ pairs from old array to new one. Every time pop one pair and push two (its children in the original array). Of course coding this would be a mess, but it would guarantee the theoretical upper bound of $$$O(n+k\log k)$$$.

»
4 years ago, # |
  Vote: I like it +12 Vote: I do not like it

An interesting question.

There is a presentation from CppCon 2018 by Fred Tingaud which addresses the matter.

Slides: https://github.com/CppCon/CppCon2018/blob/master/Presentations/a_little_order_delving_into_the_stl_sorting_algorithms/a_little_order_delving_into_the_stl_sorting_algorithms__fred_tingaud__cppcon_2018.pdf

Video, presumably: https://www.youtube.com/watch?v=-0tO3Eni2uo.

Here's a short summary from slide 87:

The typical usage of std::partial_sort is to sort a small subset of elements in a big container.
The STL implementers chose a faster $$$O(N \cdot \log(k))$$$ algorithm that performs well for this typical use-case at the expense of other scenarios.