I always seem to underestimate how much readers of this blog enjoy a good coding challenge. A few days ago, one of my coworkers was tripped by a line of code and I thought I’d share his confusion with people following me on Twitter, which turned out to be a fitting medium for this micro challenge:
Pop quiz: "abc.def".split(".").length should return...?
I really didn’t think much of it but I got much more responses than I anticipated. A lot of them were incorrect (hey, if I throw a challenge, there has to be a trick), but I still want to congratulate everyone for playing the game and answering without testing the code first. That’s the spirit!
A few people saw the trap, and interestingly, they pointed out that I must have made a mistake in the question instead of just giving the correct answer (“You must have meant split(“\\.”)”).
As hinted above, the trick here is that the parameter to java.lang.String#split is a regular expression, not a string. Since the dot matches all the characters in the given string and that this method cannot return any character that matches the separator, it returns an empty string array, so the answer is “0”.
This didn’t stop a few people from insisting that the answer is “2”, which made me realize that the code snippet above happens to be valid Ruby code, a pretty unlikely coincidence. So to all of you who answered 2 because you thought this was Ruby, you were right. To the others, you are still wrong, sorry 🙂
The bottom line is that this method is terribly designed. First of all, the string confusion is pretty common (I’ve been tricked by it more than a few times myself). A much better design would be to have two overloaded methods, one that takes a String (a real one) and one that takes a regular expression (a class that, unfortunately, doesn’t exist in Java, so what you really want is a java.util.regex.Pattern).
API clarity is not the only benefit of this approach, there is also the performance aspect.
With the current signature, each call to split() causes the regular expression to be recompiled, as the source sadly confirms:
public String[] split(String regex, int limit) { return Pattern.compile(regex).split(this, limit); }
Since it’s not uncommon to have such parsing code in a loop, this can become a significant performance bottleneck. On the other hand, if we had an overloaded version of split() that accepts a Pattern, it would be possible to precompile this pattern outside the loop.
Interestingly, that’s how Ruby implements split():
If pattern is a String, then its contents are used as the delimiter when
splitting str. If pattern is a single space, str is split on whitespace, with
leading whitespace and runs of contiguous whitespace characters ignored.If pattern is a Regexp, str is divided where the pattern matches. Whenever
the pattern matches a zero-length string, str is split into individual
characters.
But there is a catch: since Ruby is dynamically typed (which means it doesn’t support overloading either), the determination of the class of the parameter has to be done at runtime, and even though this particular method is implemented in C, there is still an unavoidable performance toll to be paid, which is unfortunate for such a core method.
The conclusion of this little drill is that if you are going to split strings repeatedly, you are better off compiling the regular expression yourself and use Pattern#split instead of String#split.
#1 by Joshua Foster on July 4, 2009 - 4:19 pm
Interesting, nice tip.
#2 by Evgeny on July 4, 2009 - 7:36 pm
Yep, Ruby sure works a bit differently.
irb(main):001:0> “abc.def”.split(“.”).length
=> 2
irb(main):002:0> “abc.def”.split(/\./).length
=> 2
irb(main):003:0> “abc.def”.split(/./).length
=> 0
#3 by Alex Blewitt on July 5, 2009 - 11:10 am
It’s also valid Python code, wot gives 2
>>> len(“abc.def”.split(“.”))
2
#4 by Alex Blewitt on July 5, 2009 - 3:35 pm
Ok, so I cheated a little and moved the .length to .len. However, it’s also a valid JavaScript expression, at least according to WebKit’s development console:
> “abc.def”.split(“.”).length
2
#5 by Kevin on July 6, 2009 - 2:52 pm
Read the Java Docs for Pattern before you write stuff like this. There is no perform problems if acutally know how to use Pattern:
Pattern matchPeriod = Pattern.compile(“\\.”);
for(String s : getSomeStrings()){
String[] list = matchPeriod.split(s);
// do whatever
}
java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html#split(java.lang.CharSequence)
#6 by Cedric on July 6, 2009 - 3:18 pm
Kevin, this is *exactly* what I recommend at the end of the post. What part of “Avoid using String#split in a loop” do you disagree with?
#7 by unacoder on July 6, 2009 - 7:19 pm
i would have guessed 8, but that’s javascript. i imagine perl might behave the same way. i could never bring myself to like the java implementation.
#8 by Charles Oliver Nutter on July 9, 2009 - 10:49 am
A better implementation would be to have literal regular expressions in Java, so that there’s no confusion about whether a string is going to be used as a string or a regexp. 🙂
#9 by John Barnard on July 10, 2009 - 10:37 pm
Wow, I am really doing rocket science computation all the time that I can’t afford a couple of microseconds in the sake of simplicity…
#10 by ashbyp on July 13, 2009 - 3:05 pm
Interestingly,
// apache commons
StringUtils.split(str, pattern);
is about 3 times faster than the pre-compiled java version
Pattern matchPeriod = Pattern.compile(“\\.”);
for(String s : getSomeStrings()){
String[] list = matchPeriod.split(s);
// do whatever
}
Of course, you need to be spliting a sh1t load of strings for this to be important, but you might be…