Lately I've been thinking about some (possible) performance improvements for StringBuffer / StringBuilder:
Now, SBs are designed to work with char-arrays, because they are mutable, and if you append a value, you don't need to create extra objects. In most use-cases, an SB will be used to concatenate some (String)values. Most people don't set the initialCapacity, because that requires some extra code. The default cap is 16. Now, let's concat some strings:
sb.append("Hello ").append("Joris"/*user.getName()*/).append(", how are you doing ").append("Today?" /* or night */);
This will do the following:
check capacity for 6 more chars
copy "Hello " chars
count = 6
check capacity for 5 more chars
copy "Hello " chars
count = 11
check capacity for 18 more chars
expand capacity: curcap*2 or newsize -> (16+1)*2 > 5+9 --> newcap = 34
copy "Hello Joris"
copy " how are you doing " chars
count = 29
check capacity for 6 more chars
expand capacity: curcap*2 or newsize -> (16+1)*2 > 34+6 --> newcap = 70
copy "Hello Joris how are you doing "
copy "Today?" chars
count = 6
Ok.. So.. how could we prevent the expanding? If we'd have set the initialCapacity to 40, that wouldn't have happened. But as I said, nobody really does that, except for high performance stuff.
If you create an SB that will keep an linked list (or arraylist and probably a custom implementation) of strings and just add the new strings to the list. You wouldn't have to do all the copying. Only when you invoke toString() (or charAt, or something similar), you will have to create a single string. But by then, you already know how large the string will be!
That way, you prevent temporary array copies, temporary char-array creations and destructions, and therefore reducing GC.
Of course, I haven't tested all this in a real life situation :) If someone has a nice testcase for String concatenation performance, please send it to me. The tests I do are biased ofcourse ^^.
This ugly thing is something i use to reformat paragraphs in swing TextDocuments. It uses a custom stringbuffer like class that was created to remove the unneeded length checks. I replaced it with a Stringbuffer so that you can see it work. To test it take an large StyledDocument from a Swing TextComponent (a html or rtf document would do) and tell it to reparse it.
I guess this would test not only if its easier to use a linked list for those these things but if you'd be able to create regex correctly across char[] boundaries, and also how quickly can the gc get rid of it when it no longer needs it setLength(0)
package ui.parsers;
import java.awt.Color;
import java.util.AbstractSequentialList;
import java.util.Collection;
import java.util.LinkedList;
import java.util.regex.Pattern;
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.Element;
import javax.swing.text.Segment;
import javax.swing.text.StyleConstants;
import javax.swing.text.StyledDocument;
import javax.swing.text.html.HTML.Attribute;
//import ui.documents.SaneStringBuilder;
import ui.documents.SaneStyledDocument;
import util.Utils;
/**
* A class specially made to structurally standardize all kinds of
* documents that descend from the StyledDocument class in java.
* Carefull, it makes things to text to make it prose more readable
* but things like poetry are mangled by it. Forget about ASCII art
* using this.
* Removes spaces between words, and at the start of phrases
* Removes all tabs, and empty paragraphs.
* Removes the color black from characters because :
* 1) It is the default color if the color is unknown
* 2) It is the most common color and quite useless on most documents (it gets there by accident mostly)
* 3) The larger program replaces undefined colors. It would be neat to show uncommon colors and replace
* the most common one. I'm gessing that the most common one is black, and then I can show the uncommon ones
* and still allow the uncommon to show.
*
* It also fixes malformed paragraphs,
* in that the first letter like char is lowercase or pontuation.
* Things like:
* A strange city,
* but familiar nonetheless.
*
* This kind of formatting is useless for prose, and an habitual error in
* scanners.
*
* There is one situation where this doesn't fix the broken paragraph:
* But i was going to
* Paris.
*
* Since this kind of thing is usual in titles:
* A tale of two cities
* Chapter one
* It was ...
*
* would become:
* A tale of two cities Chapter one It was ...
*
* Any test of that condition would have this disavantage.
* Just fix the paragraphs in the book.
*
* @author i30817
*/
class Reparser implements IReparser {
//Midword char seperators, some are supposted to be invisible until a word break,
//but aren't in mapped in swing ['\u00AD\u00B7\u05F4\u2019\u2027]. Remove them.
//matches spaces tabs and an unmappable char at the end of a phrase
private static final String removalString = "[ \u00A0\u2027\t]*\n$";
//matches spaces tabs and an unmappable char
private static final String spaceMergingString = "[ \u00A0\u2027\t]+";
//matches things not a letter followed by a lowercase letter at the start of a phrase
private static final String paragraphMergingString = "^[^\\p{L}]*[\\p{L}&&[^\\p{Lu}]]";
private static final String paragraphLimitator = "\n";
private static final Pattern removalPattern = Pattern.compile(removalString);
private static final Pattern spaceMergingPattern = Pattern.compile(spaceMergingString);
private static final Pattern paragraphMergingPattern = Pattern.compile(paragraphMergingString);
private final Segment getterSegment = new Segment();
public Reparser(){
getterSegment.setPartialReturn(true);
}
/**
* Reparses the given document to uniformize
* the wordspacing, firstlineindent and paragraph
* spacing. These are set on other methods
* @param doc styleddocument to be standardized
* @return the standardized document. The value of any HTML.Attribute.HREF key will
* have a mapping to the Position given by an internal link (if any) in the new
* document like this: doc.getProperty(attributeSet.getAttribute(HTML.Attribute.HREF))
*/
public StyledDocument reParse(StyledDocument doc){
//its better to init this all at once to not stress the garbage collector.
//StyledDocument cache = new DefaultStyledDocument(new GapContent(doc.getLength()), StyleContext.getDefaultStyleContext());
StyledDocument cache = new SaneStyledDocument(doc.getLength());
try {
nonRecursiveReparse(doc.getDefaultRootElement(), doc, cache);
//inplaceNonRecursiveReparse(doc.getDefaultRootElement(), doc);
} catch (BadLocationException ex) {
ex.printStackTrace();
}
return cache;
}
private void nonRecursiveReparse(final Element root, final StyledDocument oldDoc, final StyledDocument newDoc) throws BadLocationException{
final int childs = root.getElementCount();
int paragraphChilds;
int start;
int end;
int m;
boolean firstLine = true, setLineSeparator = true;
Element currentParagraph, currentLine;
StringBuilder builder = new StringBuilder(250);
AbstractSequentialList currentList = new LinkedList();
for( int i = 0; i < childs; i++){
currentParagraph = root.getElement(i);
paragraphChilds = currentParagraph.getElementCount();
firstLine = true; //the first line of a paragrah if the tree is sane
if(setLineSeparator){//insert a line seperator if needed
newDoc.insertString(newDoc.getLength(),paragraphLimitator,null);
setLineSeparator = false;
}
//until the first non empty line
m = 0;
for( ; firstLine && m < paragraphChilds; m++){
currentLine = currentParagraph.getElement(m);
start = currentLine.getStartOffset();
end = currentLine.getEndOffset();
oldDoc.getText(start, end - start, getterSegment);
builder.append(getterSegment.array, getterSegment.offset, getterSegment.count);
//System.out.println("1 "+builder);
builder.replace(0, builder.length(), spaceMergingPattern.matcher(builder).replaceAll(" "));
//System.out.println("2 "+builder);
builder.replace(0, builder.length(), removalPattern.matcher(builder).replaceAll(""));
//System.out.println("3 "+builder);
//the space merging pattern coalesced the spaces.
//This removes the first one in this special case.
//(Start of paragraph)
if(builder.length() > 0 && builder.charAt(0) == ' ')
builder.deleteCharAt(0);
//reuse the variable to pass if this line is empty to the next
//(and thus the next line is still the first line)
firstLine = builder.length() == 0;
if(currentLine.getAttributes().isDefined(Attribute.NAME)){
currentList.addAll((Collection)currentLine.getAttributes().getAttribute(Attribute.NAME));
}
if(!firstLine){//we have something and
//the last paragraph didn't really end (by this one first lower-case letter)
//erase the \n and join both.
if(newDoc.getLength() != 0 && paragraphMergingPattern.matcher(builder).find()){
//Last char in the parsed text is \n
//remember this only happens once per paragraph
newDoc.remove(newDoc.getLength()-1,1);
//there is no space at the end of the parsed char so we have to insert it (it was removed in the removalPattern)
builder.insert(0, ' ');
}
//insert the string with the old atributes except Color if black
Utils.vetoSpecificCharacterAttribute(currentLine, StyleConstants.Foreground, Color.BLACK);
//Save the name attributes with a pointer to it in the properties (hyperlink pointer)
for(Object s : currentList)
newDoc.putProperty(s, Integer.valueOf(newDoc.getLength()));
currentList.clear();
newDoc.insertString(newDoc.getLength(),builder.toString(), currentLine.getAttributes());
//we now can set a line seperator (we have something inserted)
setLineSeparator = true;
builder.setLength(0);
}
}//END FIRST LINE FOR
//after first line processing
for( ; m < paragraphChilds; m++){
currentLine = currentParagraph.getElement(m);
start = currentLine.getStartOffset();
end = currentLine.getEndOffset();
oldDoc.getText(start, end - start, getterSegment);
builder.append(getterSegment.array, getterSegment.offset, getterSegment.count);
builder.replace(0, builder.length(), spaceMergingPattern.matcher(builder).replaceAll(" "));
builder.replace(0, builder.length(), removalPattern.matcher(builder).replaceAll(""));
if(currentLine.getAttributes().isDefined(Attribute.NAME)){
currentList.addAll((Collection)currentLine.getAttributes().getAttribute(Attribute.NAME));
}
if(builder.length() != 0){//now finally insert the processed line
//insert the string with the old atributes except Color if black
Utils.vetoSpecificCharacterAttribute(currentLine, StyleConstants.Foreground, Color.BLACK);
AttributeSet set = currentLine.getAttributes();
for(Object s : currentList)
newDoc.putProperty(s, Integer.valueOf(newDoc.getLength()));
currentList.clear();
newDoc.insertString(newDoc.getLength(),builder.toString(), set);
builder.setLength(0);
}
}//END LINES FOR
}//END PARAGRAPHS FOR
}
}