laitimes

Java's String focuses on parsing the foreword immutability of storage principlesString stitching encoding problem summary

author:JAVA architecture

String s = new String("abc") How many objects does this code create? What is the result of this judgment of s=="abc"? s.substring(0,2).intern()=="ab" What is the result of this?

Can s.charAt(index) really represent all the corresponding characters?

Is "abc"+"gbn"+s direct string stitching really less performant than using StringBuilder?

<h1 class="pgc-h-arrow-right" > preface</h1>

Nice to meet you~

The String object features in Java are very different from the C/C++ language, with an emphasis on their immutability. So for the sake of the design of the service string immutability, a lot of related questions arise: Why keep it immutable? How does the underlying store strings? How do I perform string manipulation for better performance? Wait a minute. In addition, knowledge of character encoding is also very important; after all, it is normal to use emoij now.

The content of the article revolves around the focus of immutable:

Analyze the immutability of String objects;

The storage principle of the constant pool and the principle of the intern method

The principle of string stitching and optimization

The difference between a code unit and a code point

summary

So, let's get started

<h1 class="pgc-h-arrow-right" > immutability</h1>

To understand the immutability of String, we can simply look at a few lines of code:

string.replace("a","b") this method replaces a in "abcd" with b. The output shows that the original string has not changed in any way, and the replace method constructs a new string "bbcd" and assigns a value to the string1 variable. That's the immutability of String.

Another chestnut: change the last character of "abcd" to a, in the C/C++ language, you can directly modify the last character, and in Java, you need to recreate a String object: abca, because "abcd" itself is immutable and cannot be modified.

String object values are immutable, and all operations do not change the string value, but are implemented by constructing a new string.

A lot of times it's hard to understand why Java is designed this way, and doesn't that lead to a decrease in performance? Looking back at the scene of our daily use of String, more often than not directly modifying a string, but once used, it is abandoned. But next time, most likely, the same String object is used again. For example log printing:

We don't need to change the previous "MainActivity", but we use this string frequently. Java designed String to be immutable precisely to maintain data consistency so that stringing of the same literal amount refers to the same object. For example:

s1 and s2 refer to the same String object. If String is variable, then this design cannot be implemented. Thus, we can reuse the String object that we have created without having to recreate it.

Based on the premise that there are more cases of reusing String than changing Scene, Java designs String to be immutable, maintaining data consistency, so that strings of the same literal amount can reference the same String object, reusing existing String objects.

Another point is mentioned in the book Java Programming Thoughts. Let's first look at the following code:

The allCase method capitalizes all incoming String objects and returns the modified string. At this point, the caller's expectation is that the incoming String object is only intended to serve as information and does not want to be modified, so the immutable nature of String is very much in line with this.

When using string objects as parameters, we want not to change the String object itself, and the immutability of String conforms to this.

< h1 class="pgc-h-arrow-right" > storage principle</h1>

Due to the immutable nature of String objects, they are also different from ordinary objects in storage. We all know that objects are created on the heap, and String objects are actually the same, different, and stored in constant pools. String objects in the heap are highly likely to be reclaimed at GC time, while String objects in constant pools are not easily recycled, so you can reuse String objects in constant pools. That is, a constant pool is the root cause of the reuse of String objects.

The characteristics of constant pools are not easily garbage collected, so that string objects in constant pools can always exist and be reused.

There are two ways to create String objects in a normal pool: explicitly constructing a string object in double quotation marks, and using the intern() method of a String object. These two methods do not necessarily create an object in the constant pool, and if the same object already exists in the constant pool, they return a reference to that object directly, reusing the String object. Other ways to create String objects are to create String objects in the heap. Take a chestnut.

When we go through the new String() method or call an instance method of the String object, such as the string.substring() method, a String object is created in the heap. And when we create a string object with double quotation marks, such as String s = "abc", or call the intern() method of the String object, an object is created in the constant pool, as shown in the following figure:

Java's String focuses on parsing the foreword immutability of storage principlesString stitching encoding problem summary

Remember the question at the beginning of our article?

String s = new String("abc"), how many objects does this code create? "abc" constructs an object in the constant pool, and the new String() method creates another object in the heap, so there are two in total.

The result of s=="abc" is false. Two different objects, one in the heap and one in the constant pool.

The s.substring(0,2).intern()=="ab" intern method constructs a String object with a value of "ab" in the constant pool, and the "ab" statement does not build a new String object, but returns an already existing String object. So the result is true.

Only the explicit use of double quotation marks to construct string objects, the use of string objects of intern() methods, both methods create String objects in the constant pool, and the other methods are created in the heap. Each time before creating a String object in the constant pool, the existence of the same String object is checked, and if so, a reference to the object is returned directly without recreating an object.

There is another problem with the intern method, the specific logic executed in different JDK versions is different. Before jdk6, the method area was stored in the immortal memory area, separated from the heap area, so when the object was created in the normal pool, it was necessary to make a deep copy, that is, to copy an object in its entirety and create a new object, as shown in the following figure:

Java's String focuses on parsing the foreword immutability of storage principlesString stitching encoding problem summary

The Immortal Generation has a serious drawback: it is prone to OOM. The Immortal Age has a memory cap and is small, and OOM can easily occur when a program calls an intern method in large numbers. In JDK7, the constant pool was migrated out of the Immortal Generation and implemented in the heap area, and the local space implementation was used later in jdk8. The implementation of the constant pool after jdk7 makes it possible to create objects in the constant pool for shallow copying, that is, there is no need to copy the entire object, but only need to copy the reference to the object, so as to avoid repeated creation of objects, as shown in the following figure:

Java's String focuses on parsing the foreword immutability of storage principlesString stitching encoding problem summary

Observe this code:

Two different objects were created before jdk6, and the output is false; after jdk7, no new objects are created in the constant pool, and the same object is referenced, so the output is true.

Jdk6 previously used intern to create a deep copy of the object, while jdk7 used a shallow copy, allowing the String object in the heap to be reused.

With the above analysis, String's real reuse of strings is when creating strings directly using double quotation marks. Using the intern method can return a string reference from a constant pool, but it already requires a String object in the heap. Thus we can conclude that:

Try to explicitly construct a string using double quotation marks; if a string needs to be reused frequently, you can call the intern method to store it in a pool of constants.

< h1 class="pgc-h-arrow-right" > string stitching</h1>

String manipulation is more common than string stitching, and due to the immutable nature of String objects, it is too performance-intensive if a new string object needs to be created for each stitching. Therefore, two classes have been officially introduced: StringBuffer and StringBuilder. These two classes can assemble strings and modify strings without creating new String objects. Run the following code:

Stitching, inserting, and deleting can all be done quickly. Therefore, it is more efficient to use StringBuilder for modification, stitching, and other operations to initialize strings. StringBuffer and StringBuilder have the same interface, but StringBuffer adds the synchronize keyword to the operation method, ensuring thread safety and paying the corresponding performance cost. StringBuilder is more recommended in single-threaded environments.

StringBuilder and StringBuffer can improve performance when stitching, modifying, and other operations to initialize strings; stringBuilder is more appropriate in a single-threaded environment.

In general, we use + to concatenate strings. + Operator overloading in java can be used to stitch strings. The compiler has also made a series of optimizations to +. Observe the following code:

For s1 strings, the compiler optimizes "ab"+"cd"+"fg" directly to "abcdefg", with String s1 = "abcdefg"; is equivalent. This optimization also reduces the consumption incurred during splicing. It's even more efficient than using StringBuilder.

S2's stitching compiler automatically creates a StringBuilder to build the string. This is equivalent to the following code: StringBuilder sb = new StringBuilder(); sb.append("hello"); sb.append(s1); String s2 = sb.toString(); So does this mean that we don't need to explicitly use StringBuilder anymore, and the compiler will help us optimize anyway? Of course not, look at the code below: String s = "a"; for(int i=0; i&lt;=100; i++){ s+=i; } There are 100 loops here, and 100 StringBuilder objects will be created, which is obviously a very wrong practice. This is where we need to show the creation of stringBuilder objects: StringBuilder sb = new StringBuilder("a"); for(int i=0; i&lt;=100; i++){ sb.append(i); } String s = sb.toString(); Just by building a StringBuilder object, performance is greatly improved.

String s3 = s2 + object; String stitching also supports direct stitching of an ordinary object, at which time the object's toString method will be called to return a string for stitching. The toString method is a method of the Object class, if the subclass is not overridden, it will call the toString method of the Object class, which defaults to output the class name + reference address. This may seem like nothing wrong, but there is a big pit: remember not to use + stitch itself directly in the toString method. The following code @Override public String toString() { return this+"abc"; Directly stitching this here will call the toString method of this, resulting in infinite recursion.

Java optimizes + stitch strings:

Normal objects can be stitched directly

Literal direct stitching combines a literal amount

Normal stitching is optimized using StringBuilder

But at the same time, we should also note that these optimizations have limitations, and we need to choose the right stitching method in the right scene to improve performance.

<h1 class="pgc-h-arrow-right" > encoding problem</h1>

In Java, in general, a char object can store one character, and the size of a char is 16 bits. But with the development of computers, character sets are also constantly evolving, and the 16-bit storage size is no longer enough, so the use of two chars, that is, 32 bits, is expanded to store some special characters, such as emoij. A 16-bit bit is called a code unit, a character is called a code point, and a code point may occupy one code unit, or it may be two.

In a string, when we call the String.length() method, the number of code units is returned, and the String.charAt() return is also the corresponding subscript code unit. This is not a problem under normal circumstances. When special characters are allowed, this is a big problem. To get the true number of code points, you can call the String .codePointCount method; to get the corresponding code points, you can call the String.codePointAt method. This is compatible with extended character sets.

A character is a code point, and a char is called a code unit. A code point may occupy one or two code units. If special characters are allowed, strings must be manipulated in code points.

<h1 class="pgc-h-arrow-right" > summary</h1>

At this point, some of the key questions about String have been analyzed, and the readers of the questions at the beginning of the article should also know the answers. These are frequently asked questions in interviews and are the focus of String. In addition, regular expressions, inputs and outputs, common APIs, etc. are also very important content related to String, and interested readers can learn on their own.

Hope this article is helpful to you.

Original link: http://www.cnblogs.com/huan89/p/14159732.html

If you find this article helpful to you, you can forward the following support