Another exciting package added by Java 2, version 1.4 is java.util.regex, which supports regular expression processing. As the term is used here, a regular expression is a string of characters that describes a character sequence. This general description, called a pattern, can then be used to find matches in other character sequences. Regular expressions can specify wildcard characters, sets of characters, and various quantifiers. Thus, you can specify a regular expression that represents a general form that can match several different specific character sequences.
There are two classes that support regular expression processing: Pattern and Matcher. These classes work together. Use Pattern to define a regular expression. Match the pattern against another sequence using Matcher.
Pattern
The Pattern class defines no constructors. Instead, a pattern is created by calling the compile( ) factory method. One of its forms is shown here:
static Pattern compile(String pattern)
Here, pattern is the regular expression that you want to use. The compile( ) method transforms the string in pattern into a pattern that can be used for pattern matching by the Matcher class. It returns a Pattern object that contains the pattern. Once you have created a Pattern object, you will use it to create a Matcher. This is done by calling the matcher( ) factory method defined by Pattern. It is shown here:
Matcher matcher(CharSequence str)
Here str is the character sequence that the pattern will be matched against. This is called the input sequence. CharSequence is an interface that was added by Java 2, version 1.4 that defines a read-only set of characters. It is implemented by the String class, among others. Thus, you can pass a string to matcher( ).
Matcher
The Matcher class has no constructors. Instead, you create a Matcher by calling the matcher( ) factory method defined by Pattern, as just explained. Once you have created a Matcher, you will use its methods to perform various pattern matching operations. The simplest pattern matching method is matches( ), which simply determines whether the character sequence matches the pattern. It is shown here:
boolean matches( )
It returns true if the sequence and the pattern match, and false otherwise. Understand that the entire sequence must match the pattern, not just a subsequence of it. To determine if a subsequence of the input sequence matches the pattern, use find( ). One version is shown here:
boolean find( )
It returns true if there is a matching subsequence and false otherwise. This method can be called repeatedly, allowing it to find all matching subsequences. Each call to find( ) begins where the previous one left off. You can obtain a string containing the last matching sequence by calling group ( ). One of its forms is shown here:
String group( )
The matching string is returned. If no match exists, then an IllegalStateException is thrown. You can obtain the index within the input sequence of the current match by calling start( ). The index one past the end of the current match is obtained by calling end( ). These methods are shown here:
int start( )
int end( )
You can replace all occurrences of a matching sequence with another sequence by calling replaceAll( ), shown here:
String replaceAll(String newStr)
Here, newStr specifies the new character sequence that will replace the ones that match the pattern. The updated input sequence is returned as a string.
Regular Expression Syntax
Before demonstrating Pattern and Matcher it is necessary to explain how to construct a regular expression. The syntax and rules that define a regular expression are similar to those used by Perl 5. Although no rule is complicated by itself, there are a large number of them, and a complete discussion is beyond the scope of this chapter. However, a few of the more commonly used constructs are described here.
In general, a regular expression is comprised of normal characters, character classes (sets of characters), wildcard characters, and quantifiers. A normal character is matched as-is. Thus, if a pattern consists of “xy”, then the only input sequence that will match it is “xy”. Characters such as newline and tab are specified using the standard escape sequences, which begin with a \. For example, a newline is specified by \n. In the language of regular expressions, a normal character is also called a literal.
A character class is a set of characters. A character class is specified by putting the characters in the class between brackets. For example, the class [wxyz] matches w, x, y, or z. To specify an inverted set, precede the characters with a ^. For example, [^wxyz] matches any character except w, x, y, or z. You can specify a range of characters using a hypen. For example, to specify a character class that will match the digits 1 through 9 use [1-9].
The wildcard character is the . (dot) and it matches any character. Thus, a pattern that consists of “.” will match these (and other) input seqeunces: “A”, “a”, “x”, and so on. A quantifier determines how many times an expression is matched. The quantifiers are shown here:
+ - Match one or more.
* - Match zero or more.
? - Match zero or one.
For example, the pattern “x+” will match “x”, “xx”, and “xxx”, among others.
Demonstrating Pattern Matching
The best way to understand how regular expression pattern matching operates is to work through some examples. The first, shown here, looks for a match with a literal pattern.
// A simple pattern matching demo.
import java.util.regex.*;
class RegExpr {
public static void main(String args[]) {
Pattern pat;
Matcher mat;
boolean found;
pat = Pattern.compile("Java");
mat = pat.matcher("Java");
found = mat.matches(); // check for a match
System.out.println("Testing Java against Java.");
if(found) System.out.println("Matches");
else System.out.println("No Match");
System.out.println();
System.out.println("Testing Java against Java 2.");
mat = pat.matcher("Java 2"); // create a new matcher
found = mat.matches(); // check for a match
if(found) System.out.println("Matches");
else System.out.println("No Match");
}
}
The output from the program is shown here:
Testing Java against Java.
Matches
Testing Java against Java 2.
No Match
Let’s look closely at this program. The program begins by creating the pattern that contains the sequence “Java”. Next, a Matcher is created for that pattern that has the input sequence “Java”. Then, the matches( ) method is called to determine if the input sequence matches the pattern. Because, the sequence and the pattern are the same, matches( ) returns true. Next, a new Matcher is created with the input sequence “Java 2” and matches( ) is called again. In this case, the pattern and the input sequence differ, and no match is found. Remember, the matches( ) function returns true only when the input sequence precisely matches the pattern. It will not return true just because a subsequence matches.
You can use find( ) to determine if the input sequence contains a subsequence that matches the pattern. Consider the following program.
// Use find() to find a subsequence.
import java.util.regex.*;
class RegExpr2 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("Java");
Matcher mat = pat.matcher("Java 2");
System.out.println("Looking for Java in Java 2.");
if(mat.find()) System.out.println("subsequence found");
else System.out.println("No Match");
}
}
The output is shown here:
Looking for Java in Java 2.
subsequence found
In this case, find( ) finds the subsequence “Java”. The find( ) method can be used to search the input sequence for repeated occurrences of the pattern because each call to find( ) picks up where the previous one left off. For example, the following program finds two occurrences of the pattern “test”.
// Use find() to find multiple subsequences.
import java.util.regex.*;
class RegExpr3 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("test");
Matcher mat = pat.matcher("test 1 2 3 test");
while(mat.find()) {
System.out.println("test found at index " +
mat.start());
}
}
}
The output is shown here:
test found at index 0
test found at index 11
As the output shows, two matches were found. The program uses the start( ) method to obtain the index of each match.
Using Wildcards and Quantifiers
Although the preceding programs show the general technique for using Pattern and Matcher, they don’t show their power. The real benefit of regular expression processing is not seen until wildcards and quantifiers are used. To begin, consider the following example that uses the + quantifier to match any arbitrarily long sequence of Ws.
// Use a quantifier.
import java.util.regex.*;
class RegExpr4 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("W+");
Matcher mat = pat.matcher("W WW WWW");
while(mat.find())
System.out.println("Match: " + mat.group());
}
}
The output from the program is shown here:
Match: W
Match: WW
Match: WWW
As the output shows, the regular expression pattern “W+” matches any arbitrarily long sequence of Ws. The next program uses a wildcard to create a pattern that will match any sequence that begins with e and ends with d. To do this, it uses the dot wildcard character along with the + quantifier.
// Use wildcard and quantifier.
import java.util.regex.*;
class RegExpr5 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("e.+d");
Matcher mat = pat.matcher("extend cup end table");
while(mat.find())
System.out.println("Match: " + mat.group());
}
}
You might be surprised by the the output produced by the program, which is shown here:
Match: extend cup end
Only one match is found, and it is the longest sequence that begins with e and ends with d. You might have expected two matches: extend and end. The reason that the longer sequence is found is that by default, find( ) matches the longest sequence that fits the pattern. This is called greedy behavior. You can specify reluctant behavior by adding the ? quantifier to the pattern, as shown in this version of the program. It causes the shortest matching pattern to be obtained.
// Use the ? quantifier.
import java.util.regex.*;
class RegExpr6 {
public static void main(String args[]) {
// Use reluctant matching behavior.
Pattern pat = Pattern.compile("e.+?d");
Matcher mat = pat.matcher("extend cup end table");
while(mat.find())
System.out.println("Match: " + mat.group());
}
}
The output from the program is shown here:
Match: extend
Match: end
As the output shows, the pattern “e.+?d” will match the shortest sequence that begins with e and ends with d. Thus, two matches are found.
Working with Classes of Characters
Sometimes you will want to match any sequence that contains one or more characters, in any order, that are part of a set of characters. For example, to match whole words, you want to match any sequence of the letters of the alphabet. One of the easiest ways to do this is to use a character class, which defines a set of characters. Recall that a character class is created by putting the characters you want to match between brackets. For example, to match the lowercase characters a through z, use [a-z]. The following program demonstrates this technique.
// Use a character class.
import java.util.regex.*;
class RegExpr7 {
public static void main(String args[]) {
// Match lowercase words.
Pattern pat = Pattern.compile("[a-z]+");
Matcher mat = pat.matcher("this is a test.");
while(mat.find())
System.out.println("Match: " + mat.group());
}
}
The output is shown here:
Match: this
Match: is
Match: a
Match: test
Using replaceAll( )
The replaceAll( ) method supplied by Matcher lets you perform powerful search and replace operations that use regular expressions. For example, the following program replaces all occurrences of sequences that begin with “Jon” with “Eric”.
// Use replaceAll().
import java.util.regex.*;
class RegExpr8 {
public static void main(String args[]) {
String str = "Jon Jonathan Frank Ken Todd";
Pattern pat = Pattern.compile("Jon.*? ");
Matcher mat = pat.matcher(str);
System.out.println("Original sequence: " + str);
str = mat.replaceAll("Eric ");
System.out.println("Modified sequence: " + str);
}
}
The output is shown here:
Original sequence: Jon Jonathan Frank Ken Todd
Modified sequence: Eric Eric Frank Ken Todd
Because the regular expression “Jon.*? “ matches any string that begins with Jon followed by zero or more characters, ending in a space, it can be used to match and replace both Jon and Jonathan with the name Eric. Such a substitution is not possible without pattern matching capabilities.
Using split( )
You can reduce an input sequence into its individual tokens by using the split( ) method defined by Pattern. The split( ) method is shown here:
String[ ] split(CharSequence str)
It processes the input sequence passed in str, reducing it into tokens based on the delimiters specified by the pattern. For example, the following program finds tokens that are separated by spaces, commas, periods, and exclamation points.
// Use split().
import java.util.regex.*;
class RegExpr9 {
public static void main(String args[]) {
// Match lowercase words.
Pattern pat = Pattern.compile("[ ,.!]");
String strs[] = pat.split("one two,alpha9 12!done.");
for(int i=0; i < strs.length; i++)
System.out.println("Next token: " + strs[i]);
}
}
The output is shown here:
Next token: one
Next token: two
Next token: alpha9
Next token: 12
Next token: done
As the output shows, the input sequence is reduced to its individual tokens. Notice that the delimiters are not included.
Two Pattern-Matching Options
Although the pattern-matching techniques described in the foregoing offer the greatest flexibility and power, there are two alternatives which you might find useful in some circumstances. If you only need to perform a one-time pattern match, you can use the matches( ) method defined by Pattern. It is shown here:
static boolean matches(String pattern, CharSequence str)
It returns true if pattern matches str and false otherwise. This method automatically compiles pattern and then looks for a match. If you will be using the same pattern repeatedly, then using matches( ) is less efficient than compiling the pattern and using the pattern-matching methods defined by Matcher, as described previously. You can also perform a pattern match by using the matches( ) method implemented by String. It is shown here:
boolean matches(String pattern)
If the invoking string matches the regular expression in pattern, then matches( ) returns true. Otherwise, it returns false.
Exploring Regular Expressions
The overview of regular expressions presented in this section only hints at their power. Since text parsing, manipulation, and tokenization are a large part of programming, you will likely find Java’s regular expression subsystem a powerful tool that you can use to your advantage. It is, therefore, wise to explore the capabilities of regular expressions. Experiment with several different types of patterns and input sequences. Once you understand how regular expression pattern matching works, you will find it useful in many of your programming endeavors.
No comments:
Post a Comment