Skip to content

Instantly share code, notes, and snippets.

@awwsmm
Last active September 30, 2018 13:19
Show Gist options
  • Save awwsmm/b1707beead1adba391214b50fbd4a794 to your computer and use it in GitHub Desktop.
Save awwsmm/b1707beead1adba391214b50fbd4a794 to your computer and use it in GitHub Desktop.
Notes on escape sequences in Java Strings

To check if a string contains any escape characters:

jshell> String mytest1 = "this is a string with no escape sequences"
mytest1 ==> "this is a string with no escape sequences"

jshell> String mytest2 = "this string has a \n line break"
mytest2 ==> "this string has a \n line break"

jshell> Collections.disjoint(Arrays.asList(mytest1.split("")), Arrays.asList("\n"))
$21 ==> true

jshell> Collections.disjoint(Arrays.asList(mytest2.split("")), Arrays.asList("\n"))
$22 ==> false

...obviously, you would have to add all of the escape characters you're interested in into the second Arrays.asList(). Might be a bit difficult. And this only tells you whether or not there are any in the String, but doesn't give you any other information.

Regex for finding escape sequences:

(\\(?:b|t|n|f|r|\"|'|\\)|\\(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2})|\\u(?:[0-9a-fA-F]{4}))

Expanded:

    (\\                       # get the preceding slash (for each section)
      (?:b|t|n|f|r|\"|'|\\)   # capture common sequences like \n and \t

      |\\                     # OR (get the preceding slash and)...
      # capture variable-width octal escape sequences like \02, \13, or \377
      (?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2})

      |\\                     # OR (get the preceding slash and)...
      u(?:[0-9a-fA-F]{4})     # capture fixed-width Unicode sequences like \u0242 or \uFFAD
    )

This captures the slash character at the beginning of all Java escape sequences listed in 3. There are three main groups. The first one captures the slash preceding any of {b, t, n, f, r, ", ', }. These are the most common escape sequences. The second group captures the slash preceding octal escape sequences (\0 - \377, variable width). The third group captures the slash preceding Unicode escape sequences (\u0000 - \uFFFF, fixed width).

This can't be used with java.util.regex.Pattern and Matcher, though, because this assumes that "\n" is two characters -- a literal "" and a literal "n", when in reality, '\n' is a single character -- the newline character.

To do this in Java, we need the Apache Commons library StringEscapeUtils#escapeJava:

jshell> StringEscapeUtils.escapeJava("Newline \n here \u0344 and unicode \f\n\r\t\"\0\13 and more")
$136 ==> "Newline \\n here \\u0344 and unicode \\f\\n\\r\\t\\\"\\u0000\\u000B and more"

...which prepends a backslash to each escape sequence and also swaps the variable-width octal sequences for fixed-width Unicode sequences. This is useful for finding the positions of these sequences in the original String (all Apache-fied escape sequences will consist of two slash characters followed by either (a) a 'u' and exactly four hexadecimal digits, or (b) one of {b, t, n, f, r, ", }. Since I don't need to find the positions of these characters in the original String, I'm stopping here.

References:

1 https://stackoverflow.com/questions/5235401/split-string-into-array-of-character-strings

2 https://stackoverflow.com/questions/8708542/something-like-contains-any-for-java-set

3 https://stackoverflow.com/questions/1367322/what-are-all-the-escape-characters

4 https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#escapeJava-java.lang.String-

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment