13.10.2014 Views

OPTIMIZING THE JAVA VIRTUAL MACHINE INSTRUCTION SET BY ...

OPTIMIZING THE JAVA VIRTUAL MACHINE INSTRUCTION SET BY ...

OPTIMIZING THE JAVA VIRTUAL MACHINE INSTRUCTION SET BY ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

179<br />

Popping any values left on the stack after the bytecode sequence finished executing<br />

was a considerably easier undertaking than pushing the values need before the<br />

sequence could execute. The pop bytecode removes one category 1 value (an integer,<br />

float or object reference) from the stack and discards it. Similarly, the pop2 bytecode<br />

is able to remove and discard one category 2 value (a long integer or double) from the<br />

stack. Consequently, popping the remaining values was simply a matter of generating<br />

a sequence of pop and pop2 bytecodes based on the types of the values present on<br />

the stack when the sequence finished executing.<br />

Because of the time involved in profiling, it was not possible to time every candidate<br />

sequence. The transfer reduction scoring system was used to identify candidate<br />

sequences that were likely to offer improved performance. The best sequences based<br />

on transfer reduction were then timed and the sequence that offered the greatest performance<br />

improvement was selected. Unfortunately using this technique did not offer<br />

any performance gain over simple transfer reduction. This result occurred because<br />

the micro-benchmarking technique employed was not an accurate representation of<br />

the environment in which the bytecode sequence normally executes. Furthermore,<br />

using this micro-benchmarking approach completely failed to consider the cumulative<br />

impact of performing several multicode substitutions. These impacts included<br />

the added cost of decoding because the switch statement contains additional cases,<br />

the impact on cache performance due to decreased code locality and the impact on<br />

register allocation when the virtual machine is compiled due to the increased amount<br />

of code residing within the main interpretation engine.<br />

A different timing strategy was developed in order to overcome these limitations.<br />

Instead of using micro-benchmarking, testing was performed using the full benchmark.<br />

This overcame the need to generate a new Java class file since the benchmark’s<br />

application classes were used. It also overcame the problems associated with<br />

the timing being performed in a contrived environment. Furthermore, instead of timing<br />

each bytecode sequence individually, once the best multicode was identified, it<br />

was included in all subsequent timings performed. Thus the cumulative impact of<br />

performing several multicode substitutions was also considered. As was the case for<br />

micro-benchmarking, it was too expensive to time every possible bytecode sequence.<br />

Again, transfer reduction scoring was used as a predictor of the best bytecode sequences.<br />

Then the five best sequences identified by transfer reduction scoring were<br />

timed. The substitution that resulted in the best performance was selected and the<br />

process was repeated until the desired number of multicodes were identified.<br />

Using this technique has been shown to provide better performance than using

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!