資料結構--堆 Heap

1. 概念

堆是一種特殊的樹

a. 堆是完全二叉樹（除最後一層，其他層都是滿的，最後一層節點都靠左）

b. 每一個節點都大于等于（或者都小于等于）其子樹中每個節點的值

2. 操作和存儲

完全二叉樹适合用數組存儲，節省空間（不需要左右指針）

2.1 插入一個元素

2.2 删除堆頂元素

/**
 * @description: 堆
 * @author: michael ming
 * @date: 2019/5/26 22:22
 * @modified by: 
 */
#include <iostream>
#include <limits.h>
using namespace std;
class MinHeap
{
    int *arr;
    int capacity;
    int heap_size;
public:
    MinHeap(int cap)
    {
        heap_size = 0;
        capacity = cap;
        arr = new int [capacity];
    }
    ~MinHeap()
    {
        delete [] arr;
    }
    int heapsize()
    {
        return heap_size;
    }
    bool full()
    {
        return capacity == heap_size;
    }
    void MinHeapify(int i)
    {
        Heapify(i,heap_size);
    }
    void Heapify(int i, int size)
    {
        int l = left(i), r = right(i);
        int min = i;
        if(l < size && arr[l] < arr[i])
            min = l;
        if(r < size && arr[r] < arr[min])
            min = r;
        if(min != i)
        {
            swap(arr[i], arr[min]);
            Heapify(min,size);
        }
    }
    int parent(int i)
    {
        return (i-1)/2;
    }
    int left(int i)
    {
        return 2*i+1;
    }
    int right(int i)
    {
        return 2*i+2;
    }
    void delMin()
    {
        if(heap_size <= 0)
            return;
        arr[0] = arr[heap_size-1];
        heap_size--;
        MinHeapify(0);
    }
    int getMin()
    {
        return arr[0];
    }
    void insert(int val)
    {
        if(heap_size == capacity)
        {
            cout << "overflow!" << endl;
            return;
        }
        heap_size++;
        int i = heap_size-1;
        arr[i] = val;
        while(i > 0 && arr[parent(i)] > arr[i])
        {
            swap(arr[parent(i)], arr[i]);
            i = parent(i);
        }
    }
    void sort()
    {
        int size = heap_size;
        for(int i = heap_size-1; i >= 0; --i)
        {
            swap(arr[i], arr[0]);
            Heapify(0,--size);
        }
    }
    void print()
    {
        for(int i = 0; i < heap_size; ++i)
            cout << arr[i] << " ";
    }
};
int main()
{
    MinHeap minheap(8);
    minheap.insert(6);
    minheap.insert(8);
    minheap.insert(12);
    minheap.insert(4);
    minheap.insert(15);
    minheap.insert(0);
    minheap.insert(5);
    minheap.insert(9);
    minheap.insert(10);
    minheap.delMin();
    cout << minheap.getMin() << endl;
    return 0;
}

複制

3. 堆排序（不穩定排序）

3.1 建堆

方法1：一個一個的插入這種方式
方法2：從倒數第一個有葉子節點的節點開始，檢查其子節點是否滿足堆，依次往前，直到堆頂，建堆的複雜度O(n)

3.2 排序

建堆結束後，最大元素在堆頂，與最後一個元素交換（不穩定），然後對剩餘的 n-1 個元素重新建構堆，重複這個過程，最後隻剩1個元素，排序結束，建堆複雜度O(nlogn)

堆排序代碼見：https://blog.csdn.net/qq_21201267/article/details/80993672#t7

3.3 思考：為什麼快速排序要比堆排序性能好？兩者都是O(nlogn)

快排資料通路方式比較友好。快排通路資料是順序通路，堆排序是跳着通路，這樣對CPU緩存是不友好的
同樣的資料，堆排序中資料交換次數多餘快排。快排的交換次數不會比逆序度多；堆排序第一步建堆，打亂了資料原有順序，資料有序度降低，交換次數變多

4. 堆應用

4.1 優先級隊列

優先級隊列用堆實作是最直接、最高效的。優先出隊，就是堆頂取出

a. 合并有序小檔案

把多個有序的小檔案的第一個元素取出，放入堆中，取出堆頂到大檔案，然後再從小檔案中取出一個加入到堆，這樣就把小檔案的元素合并到大檔案中了。

/**
 * @description: 合并有序小檔案
 * @author: michael ming
 * @date: 2019/5/29 22:10
 * @modified by: 
 */
#include "heap.cpp"
int main()
{
    int file0[10] = {11, 21, 31, 41, 51, 61, 71, 81, 91, 101};
    int file1[8] = {1, 2, 3, 4, 5, 6, 7, 80};
    int file2[9] = {13, 23, 33, 43, 53, 63, 73, 83, 93};
    int file3[10] = {12, 22, 32, 42, 52, 62, 72, 82, 92, 102};
    int file4[7] = {15, 25, 35, 45, 55, 65, 72};
    int len0 = 10, len1 = 8, len2 = 9, len3 = 10, len4 = 7;
    int i0, i1, i2, i3, i4, j;
    i0 = i1 = i2 = i3 = i4 = j = 0;
    const int new_len = len0+len1+len2+len3+len4;
    int bigFile[new_len];
    MinHeap intheap(5);
    intheap.insert(file0[i0]);
    intheap.insert(file1[i1]);
    intheap.insert(file2[i2]);
    intheap.insert(file3[i3]);
    intheap.insert(file4[i4]);
    int top;
    while(intheap.heapsize())
    {
        top = intheap.getMin();
        bigFile[j++] = top;
        intheap.delMin();
        if(i0 < len0-1 && top == file0[i0])   //可以用list做，就不用查找最小的是哪個檔案的
        {
            intheap.insert(file0[++i0]);
        }
        else if(i1 < len1-1 && top == file1[i1])
        {
            intheap.insert(file1[++i1]);
        }
        else if(i2 < len2-1 && top == file2[i2])
        {
            intheap.insert(file2[++i2]);
        }
        else if(i3 < len3-1 && top == file3[i3])
        {
            intheap.insert(file3[++i3]);
        }
        else if(i4 < len4-1 && top == file4[i4])
        {
            intheap.insert(file4[++i4]);
        }
    }
    for(i0 = 1, j = 0; j < new_len; i0++,++j)
    {
        cout << bigFile[j] << " ";
        if(i0%10 == 0)
            cout << endl;
    }
    return 0;
}

複制

b. 高性能定時器

多個定時器，需要每隔很短的時間（比如1秒）掃描一遍，查詢哪個任務時間到了，需要開始執行，這樣有時候很多掃描是徒勞的，如果任務清單很長，掃描很耗時。采用小頂堆，最先執行的放在堆頂，就隻需要在堆頂時間到時執行堆頂任務即可，避免無謂的掃描。

4.2 用堆求 Top K（前K大資料）

a. 針對靜态資料（資料不變）

建立大小為K的小頂堆，周遊數組，數組元素與堆頂比較，比堆頂大，就把堆頂删除，并插入該元素到堆

b. 針對動态資料（資料不斷插入更新的）

在動态資料插入的時候就與堆頂比較，看是否入堆，始終維護這個堆，需要的時候直接傳回，最壞O(n*lgK)

/**
 * @description: 用堆求最大的k個元素
 * @author: michael ming
 * @date: 2019/5/30 0:06
 * @modified by: 
 */
#include "heap.cpp"
int main()
{
    int i = 0;
    const int len = 10;
    int data[len] = {2, 8, 1, 7, 12, 24, 6, 10, 90, 100};
    MinHeap intheap(5);//求前5大元素
    while(!intheap.full())
    {
        if(i < len)
            intheap.insert(data[i++]);
    }
    while(i < len)
    {
        if(data[i] > intheap.getMin())
        {
            intheap.delMin();
            intheap.insert(data[i]);
        }
        i++;
    }
    intheap.sort();
    intheap.print();
    return 0;
}

複制

4.3 求中位數

靜态資料：先排序，中間位置的數就是中位數

動态資料：動态變化，中位數位置總在變動，每次都排序的話，效率很低，借助堆可以高效的解決。

插入資料到某個堆裡後，兩個堆資料量（應相等或者大堆比小堆多1）若不滿足括号要求，則将某個堆的堆頂元素移動到另一個堆，直到滿足括号要求，堆化複雜度O(lgn)，大堆的堆頂就是中位數，求中位數複雜度O(1)。

同理，可以求99%響應時間（就是大于前面99%資料的那個資料）

跟上面過程類似，隻是大堆中儲存 n x 0.99 個資料，小堆存 n x 0.01 個資料

/**
 * @description: 利用大小兩個堆求中位數
 * @author: michael ming
 * @date: 2019/5/30 20:37
 * @modified by: 
 */
#include <algorithm>
#include <iostream>
#include <vector>
using namespace std;
void printVec(vector<int> v)
{
    for (int i = 0; i < v.size(); ++i)
        cout << v[i] << " ";
    cout << endl;
}
int main()
{
    const int len = 7;
    int staticArr[len] = {7,-1,9,0,8,77,24};//-1,0,7,*8*,9,24,77
    vector<int> maxheap, minheap;
    for(int i = 0; i < len; ++i)
    {
        if(maxheap.empty())
        {
            maxheap.push_back(staticArr[i]);
            continue;
        }
        //----選擇插入哪個堆-----
        if(!maxheap.empty() && staticArr[i] <= maxheap[0])
        {
            maxheap.push_back(staticArr[i]);
            push_heap(maxheap.begin(),maxheap.end());//預設采用 < , 大堆
        }
        else if(!maxheap.empty() && staticArr[i] > maxheap[0])
        {
            minheap.push_back(staticArr[i]);
            push_heap(minheap.begin(),minheap.end(),greater<int>());//小堆，采用 >
        }
        //----平衡兩個堆的節點比列----
        if(maxheap.size() > minheap.size() && maxheap.size() - minheap.size() > 1)
        {
            minheap.push_back(maxheap[0]);//大堆頂進入小堆
            push_heap(minheap.begin(),minheap.end(),greater<int>());
            pop_heap(maxheap.begin(),maxheap.end());//堆頂到末尾了
            maxheap.pop_back();//删除到末尾的"堆頂"
        }
        else if(maxheap.size() < minheap.size())
        {
            maxheap.push_back(minheap[0]);
            push_heap(maxheap.begin(),maxheap.end());//預設采用 < , 大堆
            pop_heap(minheap.begin(),minheap.end(),greater<int>());
            minheap.pop_back();
        }
    }
    if(maxheap.size())
        cout << "中位數為：" << maxheap[0] << endl;
    return 0;
}

複制

stl heap操作：http://www.cplusplus.com/reference/algorithm/pop_heap/

對上面程式稍加改造，對動态資料進行求解中位數

/**
 * @description: 中位數求解，利用大小堆，動态資料插入
 * @author: michael ming
 * @date: 2019/5/30 23:13
 * @modified by: 
 */
#include <algorithm>
#include <iostream>
#include <vector>
using namespace std;
void printVec(vector<int> v)
{
    for (int i = 0; i < v.size(); ++i)
        cout << v[i] << " ";
    cout << endl;
}
int main()
{
    int num;
    vector<int> maxheap, minheap, allnum;
    while(cin >> num)
    {
        allnum.push_back(num);
        if(maxheap.empty())
        {
            maxheap.push_back(num);
            cout << "排序後的數組：" << endl;
            printVec(allnum);
            cout << "中位數為：" << maxheap[0] << endl;
            continue;
        }
        //----選擇插入哪個堆-----
        if(!maxheap.empty() && num <= maxheap[0])
        {
            maxheap.push_back(num);
            push_heap(maxheap.begin(),maxheap.end());//預設采用 < , 大堆
        }
        else if(!maxheap.empty() && num > maxheap[0])
        {
            minheap.push_back(num);
            push_heap(minheap.begin(),minheap.end(),greater<int>());//小堆，采用 >
        }
        //----平衡兩個堆的節點比列----
        if(maxheap.size() > minheap.size() && maxheap.size() - minheap.size() > 1)
        {
            minheap.push_back(maxheap[0]);//大堆頂進入小堆
            push_heap(minheap.begin(),minheap.end(),greater<int>());
            pop_heap(maxheap.begin(),maxheap.end());//堆頂到末尾了
            maxheap.pop_back();//删除到末尾的"堆頂"
        }
        else if(maxheap.size() < minheap.size())
        {
            maxheap.push_back(minheap[0]);
            push_heap(maxheap.begin(),maxheap.end());//預設采用 < , 大堆
            pop_heap(minheap.begin(),minheap.end(),greater<int>());
            minheap.pop_back();
        }
        sort(allnum.begin(),allnum.end());
        cout << "排序後的數組：" << endl;
        printVec(allnum);
        if(maxheap.size())
            cout << "中位數為：" << maxheap[0] << endl;
    }
    return 0;
}

複制

4.4 思考：海量關鍵詞搜尋記錄，求topK

用散清單去重，并累加搜尋次數
再建立大小為K的小頂堆，周遊散清單，次數大于堆頂的，頂替堆頂入堆
（以上在散清單很大時，需要記憶體超過單機記憶體，怎麼辦？）
建立n個空檔案，對搜尋關鍵詞求哈希值，哈希值對n取模，得到該關鍵詞被分到的檔案号（0到n-1）
對每個檔案，利用散列和堆，分别求出topK，然後把n個topK（比如10個Top 20，200很小了吧）放在一起，出現次數最多的K（20）個關鍵詞就是這海量資料裡搜尋最頻繁的。