Introduction
 Low level stuff in Java
 Machine Level Java
 Java Native Interface
 Assembly
 Simple example
 Steps to get used to low level
 Where it should be used
 The performance comparison
 Download



The Machine Level Java

Never machine level was so close to Java developer

Easy to use assembly environment
with intellisense and other IDE stuff, 100% written in Java


Introduction

For those Java developers who's struggle for performance is not a forgotten history it is time to pay attention to the new performance related technology. It makes a big step towards performance increase of critical algorithms, written in Java. It connects machine level optimization with Java environment in a way that is easy to follow and makes things faster without a need for any other tool except a preferred Java IDE. And if mastered, it can give a speed increase of some algorithms as much as five times or even better.

Now we can look at how it works. First of all it worth to mention that Java programs are executed with the help of so called JIT (Just In Time compiler). The JIT makes it's best to translate a Java code into a machine code. But as nothing is close to an ideal, the JIT isn't always performs in a best possible way. So for a performance critical application we need something better. But what options do we have?

Low level stuff in Java

The main problem with any binary data is it's manageablity. Humanity has invented high level languages to help an ordinary person to cope with binary data complexity. And in Java any binaries are hidden within a virtual machine without any low level piece to be exposed to a programmer. But as a negative consequence we not always have the best performance possible. To get it again we need to go back to an experience of the past. But if we are going back it doesn't mean we should throw away all achievements we have so far.

First of all we shouldn't throw away our knowledge and habits. That's why low level in Java do not requires you to use anything from outside of the Java world. However, one part is relatively new for Java, but it is as new as any new Java technology, no more and nothing special. When we speak about technologies from the area of J2SE or J2EE, we can remember our study of some parts of plain or enterprise Java. Exactly the same study expects us if we want to learn the Machine Level Java. Actually it is the same set of APIs and new domain knowledges, that we expect to meet in any other area where Java programming is applicable.

Machine Level Java

First part of Machine Level Java is a method we can use to call low level programs from Java. In Java Language Specification a native method definition is given. It is our way into the low level world. To open the gate it is required to mark a method declaration (method without body) as native, just in the same manner as we can do it when using other modifiers of a method.

Next, as is also the case with interface methods, we need to provide a method implementation. With the help of MLJ (Machine Level Java) the implementation looks just like ordinary Java method. But in fact it produces binary data, that hardware accepts as a low level program. The simplest way to use MLJ features is to extend one of classes, that implement root MLJ interface InlineAssembly, which defines method:

public void writeNativeCode();

It is better for implementation quality and overall simplicity if one successor of InlineAssembly implements one native method. As a result we see a very simple interface with one method for us to implement. This method is not implemented by direct successors of the InlineAssembly interface because they are not intended to fulfill a particular need for low level, but instead just provide some architecture specific helper stuff. It means that a programmer, who looks for better performance, should first define a native method in a class, that extends one of direct successors of the InlineAssembly. Next a programmer should implement writeNativeCode method, where all low level actions will take place. But in contrast with a standard Java practice, when a Java programmer deals with parameters and return value, on low level we do not see any parameter, but should get them by ourself. And here MLJ make life easier and provides helper method to get parameters easily accessible. Also MLJ helps with calling JVM's own methods from the low level.

Another convenience of MLJ is a native library framework. It hides from a programmer the complexity of providing JVM with a native binary code. Behind the scene MLJ creates required native libraries and loads them into JVM. But for a Java developer this is invisible part of the process and he/she runs a Java program in a standard way and has no need to think about any native library. Every time a native method implementation is changed, MSJ recreates and reloads native library transparently for a Java developer. If there were no changes then native library stays untouched. For all this convenience to work properly it is required to create a static initializer of a class, that contains a native method. In the initializer Java developer invokes one of InlineAssemblyHelper's static methods with name initializeNativeCode. It is also possible to initialize native library somewhere else, but then Java developer should remember that the initialization must precede an invocation of a native method.

But most important part of MLJ is the Assembler project, which is taken from Java Embryo System (aka jEmbryoS). It defines API for interaction with the main hardware item - the processor. But before we look at a processor, we should take a look at Java Native Interface.

Java Native Interface

Java Native Interface (JNI) defines interaction between low level programs and JVM. It forwards Java parameters for native method, acquires return value from low level and forwards it to a caller of a native method, also it allows to make calls from low level to a JVM. The native method parameters are passed to the low level according to so called call convention, that is used by target platform (operating system and processor architecture). Essentially any widely used call convention passes parameters using computer's memory. It places references to the native method parameters in memory and makes sure that hardware has address of the group of references. So to get the parameters we need first to get the address from a hardware and next, to read reference from memory at a given address. But MLJ technology hides this process from a programmer by providing a method, that finds actual value if provided with a parameter index. The method name is parameterIn, it takes parameter index and a register, where parameter's value will be stored.

JNI passes more parameters to the low level, than a native method has. First two parameters are always added by JNI and represent a reference to a JNI internals and a Java object, which is an owner of a native method. First reference can be used to call JVM's functions. Second reference represents this for non static native methods or class object for static native methods. After first two always present parameters there go all the native method parameters (they are defined in a native method declaration). If native method has parameter of a primitive type, then it's value is copied when passed to low level. But if parameter's type is not primitive, then low level sees a reference to it (it's address in memory). Reference type parameters can be used to acquire a Java object filed value or to change it, or to read from a Java array, or fill array with some values. Copied types (primitive in Java) can be used for calculations and as a value when Java object's field is modified or Java array is written to.

For reference type parameters JNI defines a group of usage rules. First, access to fields of referenced object or items of referenced array is always made using first parameter, which is a reference to the JVM's internals. So to get a value from array we first need to call JVM and to ask it to give us the array's data. For primitive arrays the data is returned as a sequence of values in memory. For object arrays and for other reference types the data is returned as one reference per call.

Next important thing about JNI references is a need to release every reference after it was acquired from JVM. The parameter references should not be released, but any use of such parameter leads to a call to JVM for it to provide some value. If the value is of reference type, then it should be released before low level returns. It is important because the JVM manages memory allocations with the help of a garbage collector and the garbage collector should know if an object is free to be dismissed or it is not. If low level gets a reference and doesn't release it, then garbage collector is unable to free memory, occupied by the referred object. That's why JNI provides access to the group of functions, which name starts with the word "Release".

But before we release something we should get a reference to it. The reference can be acquired from JVM using JNI functions. MLJ provides us with a set of methods with the callFunction name. Basically it takes a function name (in form of an enum constant) and list of parameters. Parameters are provided in form of an instance of the CallParameter class. But to get a program shorter and make it more clear it is recommended to use helper functions with CallParameter return value. There are three such functions - p,v and r. First returns native method parameter data, the parameter is determined by a special enum constant. Second returns value call parameter, which represents an ordinary integer. And third returns register based call parameter. So if we need to get string length, then the following code is what we need:

callFunction(r.EAX,JniFunctions.GetStringLength,p(IP.JNIEnv),p(IP.In0));

Here the first parameter is a name of a register for temporary storage of values, the second parameter is the enum constant, which is used to provide a function name, the third parameter is a reference to the JVM's internals (it is called JNIEnv in JNI documentation), the fourth parameter is a first native method parameter reference, denoted here as an enum constant In0. Last parameter declaration means input parameter with index=0 (JNI and this/class references have individual names in IP enum).

And finally we should return some value from our native method. It is returned using processor's registers. Registers are internal hardware storages for numbers. They can keep value of a primitive Java type or an address, which points to a Java object (address also can be seen as a reference). For returning primitive types long and double two registers are used on x86 32-bit processors - EAX and EDX. For all other types - just one EAX register. But if we speak about registers then it's time to introduce the Assembly language.

Assembly

The assembly language is a way to mark every binary machine instruction with human readable mnemonic. But every instruction always has it's meaning, just like every Java method has. So for Java programmer it is convenient to look at an instruction as at some method, that a processor will execute. Of course, it is important to understand, what the processor is, but again here we can remember such entity as an application or web server and notice the similarity. Both, a processor and a server can execute some commands. And for them to run our instructions an interface is required, which is able to present our instructions in a form, that is understood by a server or by a processor. In case of Intel's x86 processors, the interface is defined in the Assembler project, which is a part of the operating system Java Embryo System (aka jEmbryoS). The project defines mnemonics for many processor instructions. From Java developer point of view every processor's instruction in a native code looks as a Java method call. As a result of such call, an x86 compatible processor will execute an action, corresponding to the meaning of a call (or instruction, as the call is named within low level society).

Next here is a brief introduction into processor's capabilities. A processor has internal buffers, called registers. All the processor can do is to manipulate values of the registers and read/write some number from/into a memory. And processor instructions represent the manipulations, a processor can make.

Now we can introduce processor instructions in more details. The historical tradition of the low level society has formed instruction representation as a short name followed by a few arguments, almost exactly as is the case with method calls (but with shorter names and without round brackets). The instructions are named according to their actions, like for example mov means an order to a processor to move a value from one place (register or memory) to another (again register or memory), add means an order to add two numbers and place the result in a register or at some memory location.

Simple example

After short introduction into the world of processors and their instructions we are ready to see a full example of a native method implementation. It looks like this:

public class SimpleNativeDemo extends X86InlineAssembly // X86InlineAssembly is a successor of InlineAssembly
{
	static // static initializer
	{
		InlineAssemblyHelper.initializeNativeCode_deleteExistingLibraryAndNoExceptions(new SimpleNativeDemo(System.out));
	}

	// constructor, which defines x86 architecture as a native method's target
	public SimpleNativeDemo(OutputStream debugStream)
	{ super(Architectures.X86.architecture, false, debugStream); }

	// native method declaration
	public static native long multiply(int x, int y);

	// native method implementation
	@Override
	public void writeNativeCode()
	{
		parameterIn(r.EAX,IP.In0.ordinal());
		parameterIn(r.EBX,IP.In1.ordinal());
		mul.x32(r.EBX);
	}
}

Here we see the class, that extends X86InlineAssembly parent class. The parent defines many useful low level methods and allows it's successors to use simple assembly syntax, which is close to the syntax, that is very common in the low level society.

After class declaration there goes the static initializer, where InlineAssemblyHelper initializes our simple native library. The initializer is ordered to overwrite existing native library if it exists and it's code differs from the code of the native method implementation. Initializer takes one parameter in form of an instance of our demo class. This instance is asked to provide a native code by means of writeNativeCode method.

Demo class's constructor takes an output stream where a textual representation of a native code will be written to. Also the constructor defines x86 architecture as the instance's target architecture.

Next goes native method declaration. It takes two parameters of type int and returns value of type long. It is expected to multiply parameter x by y and to return a product.

After it follows the native method implementation. It reads two input parameters in two registers - EAX and EBX. It uses IP (input parameters) enum constants to ensure correctness of the parameter indexes. As a first argument it takes a register, which can be selected after writing letter r and pressing the point button(.). When both parameters are in the processor's registers, it is possible to instruct the processor to multiply the provided numbers. To do it we use mul instruction (short form of multiply). The instruction is asked to provide binary code for it with the help of the method x32. The 32 here means that multiplication is performed on 32-bit integers (primitive type int in Java). It is possible to perform multiplication on 64 bit integers (long), but for the mul instruction to be able to do it, the processor should be in 64-bit mode. But now we are using instructions for 32-bit mode. Also the mul instruction takes the EBX register as a parameter. It means that the multiplication will be performed on the value from explicitly pointed register (EBX) and the value from implicitly assumed register EAX. That's why the we have put the first parameter into EAX register.

When our demo class is defined, it is perfectly possible to call it's native method from any Java program. Just write SimpleNativeDemo.multiply(5,7); and you always will get the result - 35.

If for some reasons it is not acceptable to have only one native method per one Java class, then there is an opportunity to write as much native methods in one class as you wish. But it is a bit more tedious, because there is no helper classes, that work with function calls and parameters. An example of such approach is given in TestSet's internal class, called Test. It uses NativeMethodImplementation annotation to provide initialization library with required information.

Steps to get used to low level

The Machine Level Java can be downloaded here.

But first a Java developer needs to be acquainted to the x86 processor commands. It is not very hard to get this knowledge on the net and if the performance is important then such issue is seemed as not a limiting factor. While studying processor's instructions a developer gets acquainted to the assembly syntax. This syntax is extensively used by the MLJ technology and by it's base, the Assembler project.

Next a developer can take a look at TestSet class and it's method main as a starting point. There is a number of examples, that demonstrate MLJ technology usage. For performance hungry developers the examples include matrix multiplication test, which demonstrates speed increase of five times on matrices of size 3000 in case of Java vs SSE implementation comparison on a Core 2 Duo P7350 CPU at 2 Ghz under 32-bit Windows.

Also it is highly recommended to read about Java Native Interface (JNI). At least the function list, provided by the JNI, is a must have document, which describes all JNI functions.

After there is no knowledge holes, it is possible to start trying the technology. But it is very important to understand, that most errors, that can be made while implementing native methods, can crash JVM as a best case. And as a worst case they can introduce some unpredictable behavior into production Java environment, when instead of a crash, many application server applications will behave in some erratic manner.

Also it is worth to mention the native library issue. First it is an open question for a developer how to change security manager settings if it is enabled. Next, it is a system administration issue if JVM runs under an account with limited privileges and there is no system path available for it to write to. The system path is used by JVM to load native libraries from. It is recommended to create some separate directory with JVM account write privilege and to add it to a system path list.

Where it should be used

As it was said, we can get most performance from the most suitable algorithm. But if the algorithm is as good as it should be, then again it is not always the case to look at MLJ technology. Computer system performance is a complex area where it is not enough just to move some algorithm part from one file to another (from standard Java to MLJ native method implementation). It is always required to analyze the pros and cons of MLJ technology.

Another possible replacement of the MLJ are C based native libraries. But the only thing such libraries can deliver in comparison with a tight duo of Java and MLJ, is a code that is already written. Then it is really easy to use C based library instead of it's newly developed equivalent. But if there is a choice between new C program vs new Java+MLJ program, then it is almost always the case that performance of the latter will be not worse than of the first. And the prove for it is simple. First, JIT often performs really well. Second, if it performs not very well, then it is an open question if C compiler can perform much better. Third, if the bottleneck of JIT performance is identified, then it's just a matter of time to replace poorly optimized part of a code with it's MLJ equivalent. And fourth, sometimes C compiler is able to optimize a code in a better way, than JIT and with lesser efforts, than MLJ can take, but such cases are so seldom, that it is just too expensive to have a C programmer in a Java team instead of allocation of a small reserve time for purely Java team.

MLJ can benefit the performance if:

But MLJ can not benefit the performance if: Next issue is about platform independence. Java's motto "write once, run everywhere" is not very far from MLJ's potential. For the Java's motto to become reality, there should be a JVM for every platform we are interested in. But efforts to create a JVM are so huge in comparison with the efforts, required to implement support for native library creation using MLJ on any platform, that it is hardly possible to notice MLJ related efforts against the gigantic JVM related work. So it seems as absolutely possible just to include the MLJ in a standard JVM as a part of a standard Java API with native support of the same level as, for example, is the case for files or GUI. That's why the MLJ was architected as an extensible framework with platform independent base classes.

But until the MLJ is not a part of a standard Java, it is also possible to keep Java's promise of cross-platform capability. The class InlineAssemblyHelper provides helper static method generateFallbackWrapper. This method generates a wrapper around provided InlineAssembly successor. The wrapper exposes overridden variant of a native method and makes a decision what to call - the native method or it's "no JNI" implementation, which is provided in the generated class. And of course, it is Java developer's responsibility to provide correct "no JNI" implementation. Generated class's help here is limited by declaration of an empty implementation stub only. But it is better to fill in just a predefined stub instead of tedious writing of a complete wrapper.

And finally it should be admitted, that for now only 32-bit x86 processor technology is supported in a reliable manner. 64-bit technology is implemented, but requires extensive testing. No other processor technology is implemented except the Intel's x86 32 bit architecture. Also currently the native library framework supports only 32-bit Windows solution.

The performance comparison

For the comparison to be sane it is required to compare two implementations of the same algorithm and to run them on the same hardware. The matrix multiplication algorithm was chosen and Core 2 Duo P7350 CPU at 2 Ghz under 32-bit Windows was chosen as a hardware platform. There are many matrix multiplication algorithms, but for our case it is not important how quick a particular algorithm is. More important thing to consider is the relative performance of one algorithm, but with two different implementations. That's why the simplest matrix multiplication algorithm was chosen. It represents three nested loops with the deepest performing actual multiplication and sum accumulation. So, there is an obvious part, that should be optimized. And it is the deepest loop. One version of the loop was implemented as a Java code and another as a MLJ code. In addition there was a test of different MLJ approaches with use of general purpose registers only and another one with SSE2 technology extension involvement.

One important thing here is the Java Language Specification constraints, that require us to perform all operations with the 64-bit long numbers, while the input values are 32-bit long. If we try to write 32-bit multiplication operation in Java, then it's result can exceed 32 bit. In such case Java Language Specification dictates us to forget about overflow and to use just lowest 32-bit value, that is interpreted as a signed integer and distorts resulting value even more than if there was an overflow only. That's why another test was performed when the result exactness was not taken into account.

The test results are as follows:
Matrix size Test time in milliseconds
SSE2 GPR Long Java Short Java
1500 4850 10100 22000 13000
3000 34400 75700 175200 104700
5000 153500 - - -
Here the second row means:

SSE2 - MLJ with usage of the SSE2 technology extension of the Intel's x86 processors.
GPR - MLJ with usage of general purpose registers only.
Long Java - Java only multiplication with only 64-bit operations.
Short Java - Java only multiplication with 32-bit multiplication and 64-bit addition.

As we can see the Short Java variant performs not very bad against GPR-only alternative when time difference is in range of 30-40%. It is very probable that most of C compilers will hardly perform better than the GPR only variant even with special hints from developers, but excluding inline assembly or intrinsics. It means that the best possible variant of a C program, written without inline assembly or intrinsics, can perform at best near 30-40% better, while more probable variants of C programs will perform not much better than JIT-ed version of Java. However, if we remember that exactness of a result is very important, then the difference between long and short Java versions can show us a bottleneck, that is created by the Java Language Specification and it is much worse than GPR vs Short Java difference and makes almost 70%. But when compared to SSE2, both variants of pure Java solutions perform worse in 3 to 5 times. And even GPR only solution, as an example of a possible excellent C-compiler optimization, performs a bit more than twice worse.

And as conclusion it can be said that the Machine Level Java is a very promising technology, which demonstrates performance level on par with best C compilers available with assembly inlining and intrinsics usage.

Download

The Machine Level Java can be downloaded here.

And here it is possible to discuss or to ask a question about the Machine Level Java.



Copyright (C) by Alexey Bezrodnov, year 2014