I/O Performance
This article discusses and illustrates a variety of techniques for improving Java I/O performance. Most of the techniques center around tuning disk file I/O, but some are applicable to network I/O and window output as well. The first set of techniques presented below cover low-level I/O issues, and then higher-level issues such as compression, formatting, and serialization are discussed. Note, however, the discussion does not cover application design issues, such as choice of search algorithms and data structures, nor does it discuss system-level issues such as file caching.
This article discusses and illustrates a variety of techniques for improving Java I/O performance. Most of the techniques center around tuning disk file I/O, but some are applicable to network I/O and window output as well. The first set of techniques presented below cover low-level I/O issues, and then higher-level issues such as compression, formatting, and serialization are discussed. Note, however, the discussion does not cover application design issues, such as choice of search algorithms and data structures, nor does it discuss system-level issues such as file caching.
When discussing Java I/O, it's worth noting that the Java programming language assumes two distinct types of disk file organization. One is based on streams of bytes, the other on character sequences. In the Java language a character is represented using two bytes, not one byte as in other common languages such as C. Because of this, some translation is required to read characters from a file. This distinction is important in some contexts, as several of the examples will illustrate.
Low-Level I/O Issues
High-Level I/O Issues
- Compression
- Caching
- Tokenization
- Serialization
- Obtaining Information About Files
- Further Information
Basic Rules for Speeding Up I/O
As a means of starting the discussion, here are some basic rules on how to speed up I/O:
- Avoid accessing the disk.
- Avoid accessing the underlying operating system.
- Avoid method calls.
- Avoid processing bytes and characters individually.
These rules obviously cannot be applied in a "blanket" way, because if that were the case, no I/O would ever get done! But to see how they can be applied, consider the following three-part example that counts the number of newline bytes (
'\n'
) in a file.Approach 1: Read Method
The first approach simply uses the read method on a
FileInputStream
:
|
However, this approach triggers a lot of calls to the underlying runtime system, that is,
FileInputStream.read
, a native method that returns the next byte of the file.Approach 2: Using a Large Buffer
The second approach avoids the above problem, by using a large buffer:
|
BufferedInputStream.read
takes the next byte from the input buffer, and only rarely accesses the underlying system.Approach 3: Direct Buffering
The third approach avoids
BufferedInputStream
and does buffering directly, thereby eliminating the read method calls:
|
For a 1 MB input file, the execution times in seconds of the programs are:
intro1 6.9 intro2 0.9 intro3 0.4
or about a 17 to 1 difference between the slowest and fastest.
This huge speedup doesn't necessarily prove that you should always emulate the third approach, in which you do your own buffering. Such an approach may be error-prone, especially in handling end-of-file events, if it is not carefully implemented. It may also be less readable than the alternatives. But it's useful to keep in mind where the time goes, and how it can be reclaimed when necessary.
Approach 2 is probably "right" for most applications.
Buffering
Approaches 2 and 3 use the technique of buffering, where large chunks of a file are read from disk, and then accessed a byte or character at a time. Buffering is a basic and important technique for speeding I/O, and several Java classes support buffering (
BufferedInputStream
for bytes, BufferedReader
for characters).
An obvious question is: Will making the buffer bigger make I/O go faster? Java buffers typically are by default 1024 or 2048 bytes long. A buffer larger than this may help speed I/O, but often by only a few percent, say 5 to 10%.
Approach 4: Whole File
The extreme case of buffering would be to determine the length of a file in advance, and then read in the whole file:
|
This approach is convenient, in that a file can be treated as an array of bytes. But there's an obvious problem of possibly not having enough memory to read in a very large file.
Another aspect of buffering concerns text output to a terminal window. By default,
System.out
(a PrintStream
) is line buffered, meaning that the output buffer is flushed when a newline character is encountered. This is important for interactivity, where you'd like to have an input prompt displayed before actually entering any input.Approach 5: Disabling Line Buffering
But line buffering can be disabled, as in this example:
|
This program writes the integers 1..100000 to the output, and runs about three times faster than the default equivalent that has line buffering enabled.
Buffering is also an important part of one of the examples presented below, where a buffer is used to speed up random file access.
Reading/Writing Text Files
Earlier the idea was mentioned that method call overhead can be significant when reading characters from a file. Another example of this can be found in a program that counts the number of lines in a text file:
|
This program uses the old
DataInputStream.readLine
method, which is implemented using read method calls to obtain each character. A newer approach is to say:
|
This approach can be faster. For example, on a 6 MB text file with 200,000 lines, the second program is around 20% faster than the first.
But even if the second program isn't faster, there's an important issue to note. The first program evokes a deprecation warning from the Java 2 compiler, because
DataInputStream.readLine
is obsolete. It does not properly convert bytes to characters, and would not be an appropriate choice for manipulating text files containing anything other than ASCII text byte streams (recall that the Java language uses the Unicode character set, not ASCII).
This is where the distinction between byte streams and character streams noted earlier comes into play. A program such as:
|
writes an output file, but without preserving the Unicode characters that are actually output. The Reader/Writer I/O classes are character-based, and are designed to resolve this issue.
OutputStreamWriter
is where the encoding of characters to bytes is applied.
A program that uses
PrintWriter
to write out Unicode characters looks like this:
|
This program uses the UTF8 encoding, which has the property of encoding ASCII text as itself, and other characters as two or three bytes.
Formatting Costs
Actually writing data to a file is only part of the cost of output. Another significant cost is data formatting. Consider a three-part example, one that writes out lines like:
The square of 5 is 25
Approach 1
The first approach is simply to write out a fixed string, to get an idea of the intrinsic I/O cost:
|
Approach 2
The second approach employs simple formatting using "+":
|
Approach 3
The third approach uses the
MessageFormat
class from the java.text
package:
|
These programs produce identical output. The running times are:
format1 1.3 format2 1.8 format3 7.8
or about a 6 to 1 difference between the slowest and fastest. The third program would be even slower if the format had not been precompiled and the static convenience method had been used instead:
Approach 4
MessageFormat.format(String, Object[])
as in:
|
which takes 1/3 longer than the previous example.
The fact that approach 3 is quite a bit slower than approaches 1 and 2 doesn't mean that you shouldn't use it. But you need to be aware of the cost in time.
Message formats are quite important in internationalization contexts, and an application concerned about this issue might typically read the format from a resource bundle, and then use it.
Random Access
RandomAccessFile
is a Java class for doing random access I/O (at the byte level) on files. The class provides a seek method, similar to that found in C/C++, to move the file pointer to an arbitrary location, from which point bytes can then be read or written.
The seek method accesses the underlying runtime system, and as such, tends to be expensive. One cheaper alternative is to set up your own buffering on top of a
RandomAccessFile
, and implement a read method for bytes directly. The parameter to read is the byte offset >= 0 of the desired byte. An example of how this is done is:
|
The driver program simply reads the bytes in sequence and writes them out.
This technique is helpful if you have locality of access, where nearby bytes in the file are read at about the same time. For example, if you are implementing a binary search scheme on a sorted file, this approach might be useful. It's of less value if you are truly doing random access at arbitrary points in a large file.
Compression
Java provides classes for compressing and uncompressing byte streams. These are found in the
java.util.zip
package, and also serve as the basis for Jar files (a Jar file is a Zip file with an added manifest).
The following program takes a single input file, and writes a compressed output Zip file, with a single entry representing the input file:
|
The next program reverses the process, taking an input Zip file that is assumed to have a single entry in it, and uncompresses that entry to the output file:
|
Whether compression helps or hurts I/O performance depends a lot on your local hardware setup; specifically the relative speeds of the processor and disk drives. Compression using Zip technology implies typically a 50% reduction in data size, but at the cost of some time to compress and decompress. An experiment with large (5 to 10 MB) compressed text files, using a 300-MHz Pentium PC with IDE disk drives, showed an elapsed time speedup of around 1/3 in reading compressed files from disk, over reading uncompressed ones.
An example of where compression is useful is in writing to very slow media such as floppy disks. A test using a fast processor (300 MHz Pentium) and a slow floppy (the conventional floppy drive found on PCs), showed that compressing a large text file and then writing to the floppy drive results in a speedup of around 50% over simply copying the file directly to the floppy drive.
Caching
A detailed discussion of hardware caching is beyond the scope of this paper. But sometimes software caching can be used to speed up I/O. Consider a case where you want to read lines of a text file in random order. One way to do this is to read in all the lines, and store them in an
ArrayList
(a collection class similar to Vector
):
|
The
getLine
method is then used to retrieve an arbitrary line. This technique is quite useful, but obviously uses a lot of memory for large files, and so has limitations. An alternative might be to simply remember the last 100 lines that were requested, and read from the disk for any other requests. This scheme works well if there is locality of access of the lines, but not so well if line requests are truly random.Tokenization
Tokenization refers to the process of breaking byte or character sequences into logical chunks, for example words. Java offers a
StreamTokenizer
class, that operates like this:
|
This example tokenizes in terms of lower-case words (letters a-z). If you implement the equivalent yourself, it might look like:
|
The second program runs about 20% faster than the first, at the expense of having to write some tricky low-level code.
StreamTokenizer
is sort of a hybrid class, in that it will read from character-based streams (like BufferedReader
), but at the same time operates in terms of bytes, treating all characters with two-byte values (greater than 0xff
) as though they are alphabetic characters.Serialization
Serialization is used to convert arbitrary Java data structures into byte streams, using a standardized format. For example, the following program writes out an array of random integers:
|
and this program reads the array back in:
|
Note that we used buffering to speed the I/O operations.
Is there a faster way than serialization to write out large volumes of data, and then read it back? Probably not, except in special cases. For example, suppose that you decide to write out a 64-bit long integer as text instead of as a set of 8 bytes. The maximum length of a long integer as text is around 20 characters, or 2.5 times as long as the binary representation. So it seems likely that this format wouldn't be any faster. In some cases, however, such as bitmaps, a special format might be an improvement. However, using your own scheme does work against the standard offered by serialization, so doing so involves some tradeoffs.
Beyond the actual I/O and formatting costs of serialization (using
DataInputStream
and DataOutputStream
), there are other costs, for example, the need to create new objects when deserializing.
Note also that the methods of
DataOutputStream
can be used to develop semi-custom data formats, for example:
|
and:
|
These programs write 10 integers to a file and then read them back.
Obtaining Information About Files
Our discussion so far has centered on input and output for individual files. But there's another aspect of speeding I/O performance, that relates to finding out properties of files. For example, consider a small program that prints the length of a filename:
|
The Java runtime system itself cannot know the length of a file, and so must query the underlying operating system to obtain this information. This holds true for other file information, such as whether a file is a directory, the time it was last modified, and so on. The
File
class in the java.io
package provides a set of methods to query this information. Such querying is in general expensive in terms of time, and should be used as little as possible.
A longer example of querying file information, one that recursively walks the file system roots to dump out a set of all the file pathnames on a system, looks like this:
|
This example uses
File
methods, such as isDirectory
and exists
, to navigate through the directory structure. Each file is queried exactly once as to its type (plain file or directory).
No comments:
Post a Comment