As a collection that we are familiar with, HashMap can be said to be a mandatory interview question. Simple use, then to the principle, data structure, can also be extended to concurrency, it can be said, just a HashMap, can chat for half an hour.

1. Can you tell us about the data structure of HashMap?

The data structure of JDK1.7 is an array + linked list, JDK1.7 is still being used? No way......

Let's talk about the data structure of JDK1.8:

The data structure of JDK1.8 is array + linked list + red-black tree.

The data structure diagram is as follows:

HashMap's soul chasing twenty-three questions, face scum counterattack is a must.

Among them, the bucket array is used to store data elements, the linked list is used to resolve conflicts, and the red-black tree is to improve the efficiency of the query.

Data elements are mapped to the position of the corresponding index of the bucket array through a mapping relationship, that is, a hash function
If a conflict occurs, pull a linked list from the location of the conflict and insert the conflicting element
If the linked list length >8&The array size >=64, the linked list turns to a red-black tree
If the number of red-black tree nodes < 6, switch to a linked list

2. What do you know about red-black trees? Why not use a binary tree/balance tree?

The red-black tree is essentially a binary lookup tree, and in order to maintain balance, it adds some rules to the binary lookup tree:

Each node is either red or black;
The root node is always black;
All leaf nodes are black (note that the leaf nodes here are actually NULL nodes in the figure);
The two child nodes of each red node must be black;
The path from either node to each leaf node in its subtree contains the same number of black nodes;

The reason for not using binary trees:

The red-black tree is a balanced binary tree, and the worst-time complexity of inserting, deleting, and looking is O(logn), avoiding the worst-case O(n) time complexity of binary trees.

The reason for not using a balanced binary tree:

The balanced binary tree is a stricter balance tree than the red-black tree, and in order to maintain balance, it needs to rotate more times, which means that the balanced binary tree is less efficient at maintaining balance, so the balance binary tree insertion and deletion efficiency is lower than that of the red-black tree.

3. Do you know how red and black trees maintain balance?

Red-black trees maintain balance in two ways: rotating and dyeing.

Rotation: There are two types of rotation, left-handed and right-handed

Stain:

4.Do you know HashMap's put process?

Let's start with the flowchart:

First, perturb the hash value to obtain a new hash value. (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
Determine whether the tab is empty or the length is 0, and if so, expand the capacity.
if ((tab = table) == null || (n = tab.length) == 0) n = (tab = resize()).length; 复制代码
According to the hash value, the subscript is calculated, and if the corresponding small mark does not store data, it can be inserted directly, otherwise it needs to be overwritten. tab[i = (n - 1) & hash])
Determine whether tab[i] is a tree node, otherwise insert data into the linked list, and insert a node into the tree.
If the length of the linked list is greater than or equal to 8 when inserting nodes in the linked list, you need to convert the linked list to a red-black tree. treeifyBin(tab, hash);
Finally, after all elements are processed, determine whether the threshold is exceeded; threshold, if it exceeds it, it expands.

5.How does HashMap find elements?

Let's look at the flowchart first:

HashMap lookup is much simpler:

Use the perturbation function to get a new hash value
Calculate the array subscript and get the node
The current node matches the key and returns it directly
Otherwise, whether the current node is a tree node, look for the red and black trees
Otherwise, traverse the linked list lookup

6.How is the hash/perturbation function of HashMap designed?

The hash function of HashMap is to first get the hashcode of the key, which is a 32-bit int value, and then let the upper 16 bits and lower 16 bits of the hashcode perform XOR operations.

static final int hash(Object key) {
        int h;
        // key的hashCode和key的hashCode右移16位做异或运算
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
复制代码

This is designed to reduce the probability of hash collisions.

7. Why does the hash/perturbation function reduce hash collisions?

Because the key.hashCode() function calls the hash function that comes with the key key type, it returns an int hash value. The int value ranges from -2147483648~2147483647, which adds up to about 4 billion map spaces.

As long as the hash function is mapped evenly and loosely, it is difficult for general applications to collide. But the problem is that an array of 4 billion lengths, memory can't fit.

If the initial size of the HashMap array is only 16, you need to modulo the length of the array before, and the remainder obtained can be used to access the array subscript.

In the source code, the modulo operation is to do an "and&" operation on the hash value and the array length - 1, and the bit operation is faster than the remaining % operation.

bucketIndex = indexFor(hash, table.length);

static int indexFor(int h, int length) {
     return h & (length-1);
}
复制代码

By the way, this also explains why the array length of the HashMap should be an integer power of 2. Because this (array length - 1) is exactly equivalent to a "low bitmask". The result of the AND operation is that the high bits of the hash value are all zeroed, and only the low bit value is retained for array subscript access. Take the initial length 16 as an example, 16-1=15. The base 2 representation is 0000 0000 0000 0000 0000 0000 0000 1111. Do the AND operation with a hash value as follows, and the result is that the lowest four-digit value is intercepted.

This is faster, but the new problem comes, even if the hash value distribution is loose, if only the last few digits are taken, the collision will be serious. If the hash itself is not done well, the distribution is a loophole in the series of equal differences, and if the last few lower bits are regularly repeated, it is even more difficult.

At this time, the value of the perturbation function is reflected, take a look at the schematic diagram of the perturbation function:

Moving right by 16 bits is exactly half of 32bit, and its own high and lower regions are OR, in order to mix the high and low bits of the original hash code to increase the randomness of the low bits. Moreover, the mixed low bit is doped with some of the features of the high bit, so that the information of the high bit is also retained in disguise.

8. Why is the capacity of HashMap a multiple of 2?

The first reason is to facilitate hash fetching:

Placing the elements on the table array is to locate the position with the hash value % array size, while HashMap is using the hash value &(array size -1), but it can achieve the same effect as before, thanks to the fact that the size of HashMap is a multiple of 2, and the multiple of 2 means that only one bit of the binary bit of the number is 1, and the number -1 can get the binary bit 1 becomes 0, the subsequent 0 becomes 1, and then through the &operation, you can get the same effect as %, and the bit-to-operation ratio % Much more efficient

When the capacity of HashMap is 2 to the nth power, the base 2 of (n-1) is in the form of 1111111***111, so that when bitwise operations with the hash value of the added element, it can be fully hashed, so that the added element is evenly distributed in each position of the HashMap to reduce hash collision.

The second aspect is that when expanding, the expanded size is also a multiple of 2 to perfectly transfer the elements that have been hashed to the new table

We can briefly look at the scaling mechanism of HashMap, and the elements in HashMap will expand when they exceed the load factor * HashMap size.

9. If you initialize HashMap and pass a value of 17 to the new HashMap <>, what will it do?

To put it simply, when initializing, when not a multiple of 2 is passed, HashMap will look up for the nearest multiple of 2, so 17 is passed in, but the actual capacity of HashMap is 32.

Let's take a look at the details, in the initialization of HashMap, there is such a method;

public HashMap(int initialCapacity, float loadFactor) {
 ...
 this.loadFactor = loadFactor;
 this.threshold = tableSizeFor(initialCapacity);
}
复制代码

The threshold value, calculated by the tableSizeFor method, is calculated based on the parameters passed initially.
At the same time, this method also looks for the smallest base 2 value that is larger than the initial value. For example, if it passes 17, I should find 32.

static final int tableSizeFor(int cap) {
 int n = cap - 1;
 n |= n >>> 1;
 n |= n >>> 2;
 n |= n >>> 4;
 n |= n >>> 8;
 n |= n >>> 16;
 return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1; }
复制代码

MAXIMUM_CAPACITY = 1 << 30, which is the critical range, which is the largest set of maps.
The calculation process is to shift 1, 2, 4, 8, 16 to the right, and the original number to do | operation, which is mainly to fill in all positions of the binary with 1, when the positions of the binary are 1, it is a standard multiple of 2 minus 1, and finally add 1 to the result and then return.

Take 17 as an example, take a look at the process of initializing the calculated table capacity:

10. What other hash function constructors do you know?

The hash constructor method in HashMap is called:

Divide by retention method: H(key)=key%p(p<=N), divide the keyword by a positive integer p not larger than the length of the hash table, and the remainder is the address, of course, HashMap has been optimized and transformed, which is more efficient and the hash is more balanced.

In addition, there are several common hash function constructors:

Direct addressing method
Map directly to the corresponding array position according to the key, for example, 1232 is placed in the subscript 1232 position.
Numerical analysis method
Take some digits of the key, such as ten and hundreds, as the mapped position
The square is taken from the middle method
Take the middle digits of the key square as the mapped position
Folding method
Split the key into segments with the same number of digits, and then add their superposition as the mapped position

11. What are the ways to resolve hash collisions?

We know by now that the reason why HashMap uses a linked list is to handle hash collisions, this method is called:

Chain address method: Pull a linked list at the location of the conflict and put the conflicting elements in.

In addition, there are some common ways to resolve conflicts:

Open addressing: The open addressing method is to find a vacancy for the conflicting elements from the location of the conflict and then down.
There are also many ways to find a free location:

Line Exploration: Starting from the conflicting location, determine whether the next position is free until a free position is found
Square Probe: Start at the conflicting position x, add 1^2 positions the first time, and add 2^2... the second time until you find a free position
……

Rehashing: Recalculate the address of the conflicting element by using a different hash function.
Create a common overflow area: create another array and put the conflicting elements in.

12. Why is the threshold value of HashMap linked list to red-black tree 8?

Denarization occurs when the length of the table array is greater than 64 and the length of the linked list is greater than 8.

Why 8? The source code comments also give the answer.

The size of the red-black tree node is about twice the size of ordinary nodes, so turning the red-black tree, sacrificing space for time, is more of a bottom-up strategy to ensure the search efficiency in extreme cases.

Why choose 8 for the threshold? It's about statistics. Ideally, using random hash codes, the nodes in the linked list conform to the Poisson distribution, and the probability of the number of nodes is decreasing, and if the number of nodes is 8, the probability of occurrence is only 0.00000006.

As for the threshold for the red-black tree to turn back to the linked list, why is it 6 instead of 8? It is because if this threshold is also set to 8, if a collision occurs, the node increase and decrease is just around 8, and the continuous conversion of the linked list and the red-black tree will occur, resulting in waste of resources.

13. When will the capacity be expanded? Why is the expansion factor 0.75?

In order to reduce the probability of hash collisions, when the number of elements in the current HashMap reaches a critical value, scaling will be triggered, and all elements will be rehashed and then placed in the expanded container, which is a time-consuming operation.

This threshold is determined by the load factor and the capacity size of the current container, if the default construction method is used:

threshold = default capacity (DEFAULT_INITIAL_CAPACITY) * default scaling factor (DEFAULT_LOAD_FACTOR)

That is, when it is greater than 16x0.75=12, the expansion operation will be triggered.

So why was 0.75 chosen as the default loading factor for HashMap?

In simple terms, this is a consideration of the balance between space cost and time cost.

There is this comment in HashMap:

We all know that the hash construction method of HashMap is Hash surplus, and the load factor determines how many elements are expanded.

If we set it to be larger, there are more elements, and there are fewer empty spaces before expanding, then the probability of hash collision increases, and the time cost of finding increases.

If we set it to be relatively small, there are fewer elements, and when there are more empty spaces, the capacity will be expanded, the probability of hash collision will be reduced, and the cost of finding time will be reduced, but more space is needed to store elements, and the space cost will increase.

14. Do you understand the scaling mechanism?

HashMap is implemented based on arrays + linked lists and red-black trees, but the length of the bucket array used to hold key values is fixed, determined by initialization parameters.

Then, as the number of data inserts increases and the load factor increases, it is necessary to expand the capacity to store more data. A very important point in scaling is the optimization operation in JDK1.8, which eliminates the need to recalculate the hash value of each element.

Because the initial capacity of HashMap is a power of 2, the length after expansion is twice as long, and the new capacity is also a power of 2, so the element is either in the original position or moved to the power of 2 in the original position.

Looking at this figure, n is the length of the table, figure A indicates that the key1 and key2 keys before expansion determine the position of the index, and figure b indicates that the two keys key1 and key2 determine the index position after expansion.

After the element is recalculated for hash, because n becomes 2x, then the mask range of n-1 is 1 bit more (red) in the upper bit, so the new index will change like this:

So when expanding, you only need to see whether the bit added to the original hash value is 0 or 1, if it is 0, the index has not changed, and the transformation of 1 into the original index + oldCap, look at the schematic diagram of 16 expansion to 32:

The main logic of node migration is as follows:

15.jdk1.8 does it make any of the main optimizations to HashMap? Why?

HashMap in jdk1.8 has five main optimizations:

Data structure: array + linked list changed to array + linked list or red-black tree
Reason: In the event of a hash conflict, the elements will be stored in the linked list, and the linked list will turn into a red-black tree if it is too long, reducing the time complexity from O(n) to O(logn)
Linked list insertion method: The insertion method of linked list has been changed from head insertion method to tail insertion method
Simply put, when inserting, if there are already elements in the array position, 1.7 put the new element into the array, the original node is the successor node of the new node, 1.8 iterates through the linked list, and places the element at the end of the linked list.
Reason: Because when the 1.7 header interpolation expands, the head interpolation will reverse the linked list, and a ring will be generated in a multi-threaded environment.
Expand rehash: When expanding, 1.7 needs to rehash the elements in the original array to locate the position of the new array, 1.8 adopts a simpler judgment logic, does not need to recalculate the position through the hash function, the new position remains unchanged or index + new capacity size.
Reason: Improve the efficiency of capacity expansion and expand capacity faster.
Expansion timing: When inserting, 1.7 first determine whether expansion is required and then insert, 1.8 insert first, and then determine whether expansion is required after insertion;
Hash function: 1.7 does four shifts and four XOR, JDK1.8 does only once.
Reason: If you do it 4 times, the marginal utility is not large, change it to one time, and improve efficiency.

16.Can you design and implement a HashMap yourself?

This question is often examined.

Don't panic, most of us can't write the red-black tree version, but the array + linked list version is still not a big problem, the details can be seen: Handwritten HashMap, Kuaishou interviewer called the insider! .

Overall design:

Hash function: hashCode() + divide by remainder method
Conflict resolution: Chain address method
Scaling: The node rehashes the location to obtain the location

Complete code:

17.Is HashMap thread-safe? What are the problems with multithreading?

HashMap is not thread-safe and these issues can occur:

Multi-threaded downscaling infinite loop. HashMap in JDK1.7 uses header interpolation to insert elements, and in a multi-threaded environment, scaling may lead to the appearance of a circular linked list, forming an infinite loop. Therefore, JDK1.8 uses the tail interpolation method to insert elements, which will maintain the original order of the linked list elements when expanding, and will not have the problem of circular linked list.
Multithreaded PUT can result in the loss of elements. Multiple threads perform put operations at the same time, and if the calculated index position is the same, it will cause the previous key to be overwritten by the next key, resulting in the loss of elements. This issue exists in both JDK 1.7 and JDK 1.8.
When put and get are concurrent, it can cause get to be null. When thread 1 executes put, it is rehashed because the number of elements exceeds the threshold, and thread 2 executes get, which may cause this problem. This issue exists in both JDK 1.7 and JDK 1.8.

18. Is there any way to solve the problem of HashMap thread insecurity?

Java has HashTable, Collections.synchronizedMap, and ConcurrentHashMap to implement thread-safe maps.

HashTable is directly added to the operation method to add the synchronized keyword to lock the entire table array, and the granularity is relatively large;
Collections.synchronizedMap is an inner class that uses the Collections collection tool to encapsulate a SynchronizedMap object by passing in the Map, defining an object lock inside, and implementing it through an object lock within the method;
ConcurrentHashMap uses segment locking in jdk 1.7 and CAS+synchronized in jdk 1.8.

19.Can you talk specifically about the implementation of ConcurrentHashmap?

ConcurrentHashmap thread safety is implemented based on segment locks in jdk 1.7 and CAS+synchronized in jdk 1.8.

1.7 Segment lock

Structurally, version 1.7 of ConcurrentHashMap adopts a segmentation lock mechanism, which contains a segment array, Segment inherits from ReentrantLock, and Segment contains an array of HashEntry, HashEntry itself is a linked list structure, with the ability to save keys and values to point to the next node of the pointer.

In fact, it is equivalent to each segment is a HashMap, and the default segment length is 16, that is, it supports 16 threads of concurrent writing, and segments will not affect each other.

Put process

The whole process is very similar to HashMap, except that it first locates the specific segment, and then operates through ReentrantLock, and the subsequent process is basically the same as HashMap.

Calculate the hash, locate the segment, and initialize the segment if it is empty
Use ReentrantLock to add a lock, if the acquisition of the lock fails, it will try to spin, and the spin will block the acquisition if the number of spins exceeds to ensure that the acquisition of the lock is successful
Traversing HashEntry, that is, like HashMap, the key and hash in the array are directly replaced, and if they do not exist, they are inserted into the linked list, and the linked list is the same operation

Get process

Get is also very simple, the key is located to the segment through the hash, and then traverses the linked list to locate the specific element, it should be noted that the value is volatile, so get does not need to be locked.

1.8 CAS+synchronized

JDK1.8 implements thread safety is not to work on data structures, its data structure and HashMap are the same, array + linked list + red and black tree. The key to its thread-safety implementation is the PUT process.

Put process

First calculate the hash, iterate through the node array, and if the node is empty, initialize it by CAS+spin

tab = initTable();
复制代码

node array initialization:

private final Node<K,V>[] initTable() {
        Node<K,V>[] tab; int sc;
        while ((tab = table) == null || tab.length == 0) {
            //如果正在初始化或者扩容
            if ((sc = sizeCtl) < 0)
                //等待
                Thread.yield(); // lost initialization race; just spin
            else if (U.compareAndSwapInt(this, SIZECTL, sc, -1)) {   //CAS操作
                try {
                    if ((tab = table) == null || tab.length == 0) {
                        int n = (sc > 0) ? sc : DEFAULT_CAPACITY;
                        @SuppressWarnings("unchecked")
                        Node<K,V>[] nt = (Node<K,V>[])new Node<?,?>[n];
                        table = tab = nt;
                        sc = n - (n >>> 2);
                    }
                } finally {
                    sizeCtl = sc;
                }
                break;
            }
        }
        return tab;
    }
复制代码

2. If the current array position is empty, write the data directly through CAS spin

static final <K,V> boolean casTabAt(Node<K,V>[] tab, int i,
                                        Node<K,V> c, Node<K,V> v) {
        return U.compareAndSwapObject(tab, ((long)i << ASHIFT) + ABASE, c, v);
    }
复制代码

If hash==MOVED, it means that you need to expand the capacity and perform the expansion

else if ((fh = f.hash) == MOVED)
                tab = helpTransfer(tab, f);
复制代码

final Node<K,V>[] helpTransfer(Node<K,V>[] tab, Node<K,V> f) {
        Node<K,V>[] nextTab; int sc;
        if (tab != null && (f instanceof ForwardingNode) &&
            (nextTab = ((ForwardingNode<K,V>)f).nextTable) != null) {
            int rs = resizeStamp(tab.length);
            while (nextTab == nextTable && table == tab &&
                   (sc = sizeCtl) < 0) {
                if ((sc >>> RESIZE_STAMP_SHIFT) != rs || sc == rs + 1 ||
                    sc == rs + MAX_RESIZERS || transferIndex <= 0)
                    break;
                if (U.compareAndSwapInt(this, SIZECTL, sc, sc + 1)) {
                    transfer(tab, nextTab);
                    break;
                }
            }
            return nextTab;
        }
        return table;
    }
复制代码

If they are not satisfied, use synchronized to write data, write data to also judge the linked list, red and black tree, linked list writing and HashMap in the same way, key hash is the same override, otherwise on the tail interpolation, the linked list length more than 8 will be converted into a red and black tree

synchronized (f){
     ……
 }
复制代码

Get queries

Get is very simple, basically the same as HashMap, calculate the position by key, table The location key is the same will be returned, if the red and black tree is obtained according to the red and black tree, otherwise it will be traversed through the linked list to get.

20.Are HashMap internal nodes ordered?

HashMap is unordered and inserted randomly based on the hash value. If you want to use an ordered Map, you can use LinkedHashMap or TreeMap.

21.How does LinkedHashMap achieve order?

LinkedHashMap maintains a double-linked list with head and tail nodes, and the LinkedHashMap node Entry not only inherits the HashMap Node property, but also has before and after to identify the predecessor node and the post-node.

You can implement sorting by insertion order or access order.

22.How does TreeMap achieve order?

TreeMap is sorted in the natural order of the keys or the order of the comprator, and is implemented internally through the red and black trees. So either the class to which the key belongs implements the Comparable interface, or customize a comparator that implements the Comparator interface and pass it to TreeMap for key comparison.

23.What about the underlying implementation of HashSet?

The underlying HashSet is based on HashMap. (The source code for HashSets is very, very small, because except for clone(), writeObject(), and readObject(), which HashSet itself has to implement, all the other methods are direct calls to methods in HashMap.)

HashSet's add method, directly call the HashMap's put method, take the added element as a key, new an Object as a value, directly call the HashMap's put method, it will judge whether the insertion of the element is successful according to whether the return value is empty.

public boolean add(E e) {
        return map.put(e, PRESENT)==null;
    }
复制代码

In the putVal method of HashMap, a series of judgments are made, and the final result is that if the key exists in HashMap, the old value will be returned. The HashSet add method returns false.

if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }

HashMap's soul chasing twenty-three questions, face scum counterattack is a must.