2017-10-12

Java String

最常见的字符串

代码版本jdk1.8.0_101

String应该是Java里最常用的类了。看似普通，但其实包含了很多设计上的独特考虑。

String的底层实现是字符数组

String.java@114

1	private final char value[];

不可变性(Immutable)

String类是final的，不可子类化
底层数组是final的，一旦赋值后不可更改引用
底层数组是private的，没有提供get方法
一些看似是get的方法都是数组拷贝

String.java

1
2
3

void getChars(char dst[], int dstBegin) {
    System.arraycopy(value, 0, dst, dstBegin, value.length);
}

encode方法最终还是拷贝

String.java

1
2
3

public byte[] getBytes() {
    return StringCoding.encode(value, 0, value.length);
}

String.java

public char[] toCharArray() {
    // Cannot use Arrays.copyOf because of class initialization order issues
    char result[] = new char[value.length];
    System.arraycopy(value, 0, result, 0, value.length);
    return result;
}

构造函数特殊处理
对于字符串参数，因为不可变，直接赋值即可

String.java

public String(String original) {
    this.value = original.value;
    this.hash = original.hash;
}

对于数组参数，采用拷贝方式,自赋值和外界断绝关系

String.java

1
2
3

public String(char value[]) {
    this.value = Arrays.copyOf(value, value.length);
}

还可以取部分数组进行构造，同样是拷贝

String.java

public String(char value[], int offset, int count) {
    //...
    this.value = Arrays.copyOfRange(value, offset, offset+count);
}

操作函数不会修改底层数组，返回的是新字符串
最后调用构造函数构建新字符串。为了避免无谓的拷贝，特意判定全长直接返回。

String.java

public String substring(int beginIndex, int endIndex) {
    if (beginIndex < 0) {
        throw new StringIndexOutOfBoundsException(beginIndex);
    }
    if (endIndex > value.length) {
        throw new StringIndexOutOfBoundsException(endIndex);
    }
    int subLen = endIndex - beginIndex;
    if (subLen < 0) {
        throw new StringIndexOutOfBoundsException(subLen);
    }
    return ((beginIndex == 0) && (endIndex == value.length)) ? this
            : new String(value, beginIndex, subLen);
}

HashCode

底层数组值决定了字符串的hash值
算法是s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
因为字符串是不可变的，所以hash值也是确定不变的，因此计算后会保存到成员变量中
计算还采用了延迟计算策略，第一次发现没有hash值时才会计算

String.java

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

都选择31的原因一方面是因为它是质数，另一方面num*31 = num*32-num = num << 5 - num,可以将乘操作优化成移位和减操作,计算方便

Intern

不可变特性使得没必要同时存在两个相同的字符串。缓存对于不可变的对象是一种很好的策略，可以节约内存，因而String类维护了常量池。字面量和字面表达式会直接加入常量池，而变量可以通过intern方法加入常量池。intern方法的含义是查询常量池，如果在则返回池内引用，如果不在则加入池中。

intern是一个native的方法

String.java

1	public native String intern();

对象a和b分别是常量和常量表达式，会加入常量池，因此引用相同。
对象c创建了新对象，因而和a并不相同。对c调用intern后返回池中引用，因此和a相同。

Sample

String a = "123";
String b = "12"+"3";
System.out.println(a==b); //true 常量和常量
String c = new String("123"); //强制堆上新建分配
System.out.println(a==c); //false  常量和堆
System.out.println(a==c.intern()); //true  常量和常量

String类由于不可变特性，无需同步，多线程可直接使用。而常量池也并非线程隔离，因此也可以在多线程间共享。

然而常量池并非百利无害，使用还需慎重，适合缓存频繁使用的字符串。

Java6以前常量池维护在PermGen，加多了会造成空间不足，Java7之后改由Heap维护
Java7之后常量池和堆在一个空间里，因此intern可以直接把堆中对象直接加入常量池，而之前intern只是建立副本放入常量池，堆中对象和常量池对象是两个

String.java

1
2
3

String a = "a";
String aa = new String(a + a);
System.out.println(aa == aa.intern()); //java 6 false java7 true  堆和新常量

Java7之后可以用-XX:StringTableSize设定大小，Java7默认大小1009,Java8提高默认值到万级

常量的长度限制

String的length方法返回int类型，理论上最大长度能达到整型最大，但是java spec定义所有字面量长度采取short类型，所以字面量最大只能65536长度。由于String还可以取null，占用2个字节，即单一字面量最大只能65534长度。

indexOf

采用简单直观的方法，找到首字符后开启一轮匹配，复杂度O(m*n)
之所以没采用KMP等O(m+n)算法可能是因为KMP需要申请额外空间并预处理，作为库方法不一定有利

String.java

public int indexOf(String str) {
    return indexOf(str, 0);
}
public int indexOf(String str, int fromIndex) {
    return indexOf(value, 0, value.length, str.value, 0, str.value.length, fromIndex);
}

static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    //目标字符串的首字符
    char first = target[targetOffset];
    //能满足目标长度的最大位置
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        //查找首字符
        if (source[i] != first) {
            while (++i <= max && source[i] != first);
        }

        //首字符匹配，循环检测后续匹配
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++);

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

replace

replace表面上是字符串替换字符串，实际上是基于正则，把字符串当成固定的模式
效率不是很高，可以采用apache的StringUtils替换

String.java

public String replace(CharSequence target, CharSequence replacement) {
    return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
            this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}

replaceAll直接接收正则表达式

String.java

1
2
3

public String replaceAll(String regex, String replacement) {
    return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}

replace并非只替换一次，而是和replaceAll一样全部替换，区别是replace传入的是纯文本，而replaceAll传入正则表达式

substring变化

substring采用[begin, end)方式
Java6版本，subString为了速度，子串依然使用原始字符串的数组值，只不过设置了不同的起点和长度
表面上看子串没有新数组生成，提升了效率，但是如果字符串很长，那么长字符串将不能被垃圾回收

String.java

// Package private constructor which shares value array for speed.
String(int offset, int count, char value[]) {
    this.value = value;
    this.offset = offset;
    this.count = count;
}

public String substring(int beginIndex, int endIndex) {
    //...
    return ((beginIndex == 0) && (endIndex == count)) ? this :
        new String(offset + beginIndex, endIndex - beginIndex, value);
}

Java7以后，真正复制了子串，因此也不再需要offset和count成员变量

String.java

public String(char value[], int offset, int count) {
    //...
    this.value = Arrays.copyOfRange(value, offset, offset+count);
}
public String substring(int beginIndex, int endIndex) {
    //...
    int subLen = endIndex - beginIndex;
    //...
    return ((beginIndex == 0) && (endIndex == value.length)) ? this
            : new String(value, beginIndex, subLen);
}

split变化

Java6版本，split直接应用了正则表达式，性能比较慢

String.java

1
2
3

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}

Java7以后，针对单一分割符做了优化，如果是单一字符或者单个转义的字符，采用indexOf+subString的方式

String.java

public String[] split(String regex, int limit) {
    /* fastpath if the regex is a
     (1)one-char String and this character is not one of the
        RegEx's meta characters ".$|()[{^?*+\\", or
     (2)two-char String and the first char is the backslash and
        the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.value.length == 1 && ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 && regex.charAt(0) == '\\' && (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 && ((ch-'A')|('Z'-ch)) < 0)) && (ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE))
    {
        //...
        String[] result = new String[resultSize];
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}

附注

String不可变，非常适合作为HashMap的key
String不可变，多线程直接共享无需保护
String不可变，垃圾回收之前会在内存驻留，涉及私密信息用char[]存储更好，使用后直接赋新值清除
String对于byte[]没有valueOf方法，但是可以通过构造函数创建，还可以指定编码

一分也是爱~

This site by Linest is licensed under a Creative Commons BY-NC-ND 4.0 International License.
由Linest创作并维护的博客采用创作共用保留署名-非商业-禁止演绎4.0国际许可证。

本文永久链接：http://linest.github.io/2017/10/12/java-string/