String is one of the most common type of data in computer languages. A Java string is a series of characters gathered together, such as “abc”. Strings are constant, which means their value cannot be changed after created.

This post introduces the basic of Java string, includes methods for examining individual characters of the sequence, for comparing strings, for searching strings, for extracting substrings and so on.

The code of String is located in java.lang.String. All classes in the java.lang package are imported by default. Thus you do not need to import java.lang.*.

All codes in this post are based on jdk10.

Member Variables

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {
    @Stable
    private final byte[] value;

    private final byte coder;
    private int hash; 

    static final boolean COMPACT_STRINGS;
    static {
        COMPACT_STRINGS = true;
    }
}
  • value is used for character storage. It is marked with Stable to trust the contents of the array, because value is never null.
  • coder is identifier of the encoding used to encode the bytes in value. The supported values in this implementation are LATIN1 and UTF16.
  • hash caches the hash code for the string.
  • COMPACT_STRINGS indicates whether String compaction is disabled. if disabled, the bytes in value are always encoded in UTF16.

For simplicity, We will consider LATIN1 only.

Member Functions

Constructor

// String.java

/*
 Initiliazes a newly created String object 
 so that it represents the same sequence of characters as the argument
 */
public String(String original) {
    this.value = original.value;
    this.coder = original.coder;
    this.hash = original.hash;
}

/*
    Constructs a new String by decoding the specified subarray of
    bytes using the specified charset. 
 */
public String(byte bytes[], int offset, int length, String charsetName)
        throws UnsupportedEncodingException {
    if (charsetName == null)
        throw new NullPointerException("charsetName");
    checkBoundsOffCount(offset, length, bytes.length);
    StringCoding.Result ret =
        StringCoding.decode(charsetName, bytes, offset, length);
    this.value = ret.value;
    this.coder = ret.coder;
}

// StringEncoding.java
static Result decode(String charsetName, byte[] ba, int off, int len)
        throws UnsupportedEncodingException {
    Charset cs = lookupCharset(csn);        
    if (cs == ISO_8859_1) {
        return decodeLatin1(ba, off, len);
    }

    // ...
}


// If a String is constructed from byte[], Arrays.copyOfRange copies the specified range of the specified array into a new array.
private static Result decodeLatin1(byte[] ba, int off, int len) {
       Result result = resultCached.get();
       if (COMPACT_STRINGS) {
           return result.with(Arrays.copyOfRange(ba, off, off + len), LATIN1);
       } else {
           return result.with(StringLatin1.inflate(ba, off, len), UTF16);
       }
    }
}    

length()

// String.java
byte coder() {
    return COMPACT_STRINGS ? coder : UTF16;
}

// for LATIN1, coder is 0
public int length() {
    return value.length >> coder();
}

isEmpty()

isEmpty() checks if length() is 0.

// String.java
public boolean isEmpty() {
    return value.length == 0;
}

charAt(int)

charAt(int) returns the char value at the specified index.

// String.java
public char charAt(int index) {
    if (isLatin1()) {
        return StringLatin1.charAt(value, index);
    } else {
        return StringUTF16.charAt(value, index);
    }
}

// StringLatin1.java
public static char charAt(byte[] value, int index) {
    if (index < 0 || index >= value.length) {
        throw new StringIndexOutOfBoundsException(index);
    }
    return (char)(value[index] & 0xff);
}

StringLatin1 returns the index of value.

equals(Object)

equals(Object) compares this string to the specified object.

// String.java
public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String aString = (String)anObject;
        if (coder() == aString.coder()) {
            return isLatin1() ? StringLatin1.equals(value, aString.value)
                              : StringUTF16.equals(value, aString.value);
        }
    }
    return false;
}

// StringLatin1.java
public static boolean equals(byte[] value, byte[] other) {
    if (value.length == other.length) {
        for (int i = 0; i < value.length; i++) {
            if (value[i] != other[i]) {
                return false;
            }
        }
        return true;
    }
    return false;
}

String is equal to another Object when,

  • another object is an instance of String
  • and coder is same
  • and each bytes at same index of this Object is equal to specified String

startsWith(String, int)

startsWith(String, int) tests if the substring of this string beginning at the specified index starts with the specified prefix.

// String.java
public boolean startsWith(String prefix, int toffset) {
    // Note: toffset might be near -1>>>1.
    if (toffset < 0 || toffset > length() - prefix.length()) {
        return false;
    }
    byte ta[] = value;
    byte pa[] = prefix.value;
    int po = 0;
    int pc = pa.length;
    if (coder() == prefix.coder()) {
        int to = isLatin1() ? toffset : toffset << 1;
        while (po < pc) {
            if (ta[to++] != pa[po++]) {
                return false;
            }
        }
    } else {
        if (isLatin1()) {  // && pcoder == UTF16
            return false;
        }
        // coder == UTF16 && pcoder == LATIN1)
        while (po < pc) {
            if (StringUTF16.getChar(ta, toffset++) != (pa[po++] & 0xff)) {
                return false;
           }
        }
    }
    return true;
}

First, make sure two strings shares same coder. Then compare two byte one by one.

indexOf(int, int)

indexOf(int, int) returns the index within this string of the first occurrence of the specified character, starting the search at the specified index.

// String.java

public int indexOf(int ch, int fromIndex) {
    return isLatin1() ? StringLatin1.indexOf(value, ch, fromIndex)
                     : StringUTF16.indexOf(value, ch, fromIndex);
}

// StringLatin1.java
public static int indexOf(byte[] value, int ch, int fromIndex) {
    if (!canEncode(ch)) {
        return -1;
    }
    int max = value.length;
    if (fromIndex < 0) {
        fromIndex = 0;
    } else if (fromIndex >= max) {
        // Note: fromIndex might be near -1>>>1.
        return -1;
    }
    byte c = (byte)ch;
    for (int i = fromIndex; i < max; i++) {
        if (value[i] == c) {
           return i;
        }
    }
    return -1;
}

From start to end, indexOf returns the index if the element at this index is same to int ch.

indexOf(String)

indexOf(String) returns the index within this string of the first occurrence of the specified substring.

// String.java

public int indexOf(String str) {
    if (coder() == str.coder()) {
        return isLatin1() ? StringLatin1.indexOf(value, str.value)
                          : StringUTF16.indexOf(value, str.value);
    }
    if (coder() == LATIN1) {  // str.coder == UTF16
        return -1;
    }
    return StringUTF16.indexOfLatin1(value, str.value);
}

// StringLatin1.java
public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
    byte first = str[0];
    int max = (valueCount - strCount);
    for (int i = fromIndex; i <= max; i++) {
        // Look for first character.
        if (value[i] != first) {
            while (++i <= max && value[i] != first);
        }
        // Found first character, now look at the rest of value
        if (i <= max) {
            int j = i + 1;
            int end = j + strCount - 1;
            for (int k = 1; j < end && value[j] == str[k]; j++, k++);
            if (j == end) {
                // Found whole string.
                return i;
            }
        }
    }
    return -1;
}

It seems indexOf(byte[, int, byte[], int, int) uses brute force, that’s interesting.

substring(int, int)

substring(int, int) returns a string that is a substring of this string.

// String.java
public String substring(int beginIndex, int endIndex) {
    int length = length();
    checkBoundsBeginEnd(beginIndex, endIndex, length);
    int subLen = endIndex - beginIndex;
    if (beginIndex == 0 && endIndex == length) {
        return this;
    }
    return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen)
                      : StringUTF16.newString(value, beginIndex, subLen);
}

// StringLatin1.java
public static String newString(byte[] val, int index, int len) {
    return new String(Arrays.copyOfRange(val, index, index + len),
                      LATIN1);
}

Arrays.copyOfRange again, and a new String will be allocated.

concat(String)

concat(String) concatenates the specified string to the end of this string.

// String.java

public String concat(String str) {
    int olen = str.length();
    if (olen == 0) {
        return this;
    }
    if (coder() == str.coder()) {
        byte[] val = this.value;
        byte[] oval = str.value;
        int len = val.length + oval.length;
        byte[] buf = Arrays.copyOf(val, len);
        System.arraycopy(oval, 0, buf, val.length, oval.length);
        return new String(buf, coder);
    }
    int len = length();
    byte[] buf = StringUTF16.newBytesFor(len + olen);
    getBytes(buf, 0, UTF16);
    str.getBytes(buf, len, UTF16);
    return new String(buf, UTF16);
}

System.arraycopy copies two byte[] together beneath.