Thursday, September 16, 2010

Bytecode, Opcodes, Dalvik, Java, Virtual Machine(VM)

SkyHi @ Thursday, September 16, 2010
I asked an Android guy to explain some things to me about Dalvik, so I could understand the Oracle v. Google situation in more depth.
I'm not a programmer, I told him, but I need to understand the tech behind the lawsuit, so I can understand when the lawyers start arguing about what Google did and didn't do and whether it was entitled to do it or not. What is Dalvik? Why use it? So I asked Mark Murphy, the founder of CommonsWare, and who is the author of three books on Android application development, including Busy Coder’s Guide to Android Development, to explain it. He also trains folks in developing for Android. Also, he doesn't work for Google, so he can speak more freely. Once you are in litigation, most companies are silent as the grave until it's decided. Don't go by SCO. That big-mouth grandstanding to the media was not typical.
If, like me, you never paid much attention before to all the intricacies of Java, it's a chance to get up to speed on all that, as the article walks us through different ways the word Java is used, sometimes a bit loosely, as well as explaining what Dalvik does and the benefits it offers.

What I understand is that there were important technical reasons why Google would want to roll their own code, involving security, memory consumption, and speed, quite aside from any legal or licensing issues.
This article isn't directly about the legal issues, by the way, the question that hovers over this matter in my mind, namely can anyone can write their own Java-like code without getting sued? This is just about the technical side, but as the case goes forward, understanding the technical bits will help us to understand the legal bits.
I've learned a few other things from researching and asking around that helped me to further understand the context for why Google went the way it did. For example, Sun's version of Java for mobile, Java ME, is released under GPLv2 only, but unlike OpenJDK, the open-source version of Java SE, there is no Classpath exception applied to Java ME, so using it might create doubt about whether the system exception applied. Moreover, none of the innovation over the past few years (like JavaFX) is in the open source version. I think then that it would be accurate to discern that Google wanted something better than what was available.
So, with that introduction, here's Mark's article:
**************************

What is Dalvik?
~ by Mark Murphy
When it comes to the Oracle lawsuit against Google regarding Android, many sites and news outlets say that “Android applications are written in Java”. As usual, this is a bit of shorthand.
To really understand what is going on, and where Oracle’s lawsuit comes into the picture, we need to have a bit more detailed picture of what really goes on when somebody writes an Android application:
1. Developers write Java‐syntax source code, leveraging class libraries published by the Android project, Apache Harmony, and other groups. 2. Developers compile the source code into Java VM bytecode, using the javac compiler that comes with the Java SDK.
3. Developers translate the Java VM bytecode into Dalvik VM bytecode, which is packaged with other files into a ZIP archive with the .apk extension.
4. An Android device or emulator runs the .apk file, causing the bytecode to be executed by an instance of a Dalvik VM.
And for most of you, that description was gibberish. That’s the reason why we use the shorthand “Android applications are written in Java” — spelling out all those details every time would get very tiresome. But, we need to sort out this gibberish to answer questions like:
What is Oracle suing over?
Why does this impact Google, if Java was released under the GPL?
Who else might be at risk due to Oracle’s decision to sue?
This article will try to explain two things, in lay terms:
1. What does all that gibberish mean? 2. What technical reasons are there for all that gibberish, compared to the similar gibberish an ordinary Java developer would use?
3. Where do the Oracle patents and such tie in, generally speaking?
First, a few disclaimers:
  • In the interests of making this stuff make sense to more ordinary people, I will wind up using some shorthand of my own from time to time. Purists will probably come up with any number of places where what I say glosses over some details. I am certainly interested in making updates and corrections as needed, where those will materially help ordinary people understand things better.
  • I will use “Sun” to refer to the firm that invented Java and created the Java development tools. I will use “Oracle” to refer to the firm that acquired Sun and, therefore, owns patents, copyrights, and trademarks relevant to Java.
  • I am an Android advocate, though I do not work for Google. While this article is not strictly intended to steer readers’ opinions one way or another on the merits of Oracle’s lawsuit, I am sure that my biases will leak through.
  • This article is written for people who have a smattering of technical knowledge, enough to, say, have made some sense over what was going on in the various lawsuits that SCO was recently a part of.
  • I have a somewhat quirky sense of humor. You have been warned.
Explaining the Gibberish Let’s take those four pieces of gibberish and examine them, one scary‐looking phrase at a time.
“Java‐syntax source code”
“Java” itself is a bit of shorthand. There are many things that can legitimately be called “Java”. One of those things is the syntax of the Java programming language.
Software developers write source code, in some programming language. Java offers one such language, but there are a crazy number of other programming languages, from FORTRAN and COBOL of the mid‐20th century to newcomers like Scala and Clojure.
Each programming language has a syntax, just as each human language has its rules of grammar and roster of available words. The Java programming language has a specific syntax.
Most — but not all — Android developers will be creating Android applications by writing Java‐syntax source code, no different than if they were writing Java applets, Java desktop applications, so‐called “Java ME” applications for some mobile phones, or Java‐based Web applications to run on a Web server somewhere.
“Class libraries published by the Android project, Apache Harmony, and other groups”
When you build a bridge, you typically do not start by opening an iron mine. Rather, you build the bridge from a mix of pre‐fabricated and custom parts. Pre‐fabricated parts might include girders and rivets. Somebody else was responsible for creating those girders, somebody else was responsible for mining the iron ore used to create the steel used to create the girders, and so on.
Similar, in software development, applications are rarely created completely from scratch. Instead, developers take advantage of pre‐fabricated software routines. One term for those, used in “object oriented” languages like Java, is a “class library”.
I mentioned earlier that there are many things that are called “Java”. Besides the syntax of the source code, some people refer to certain class libraries as being “Java”. Sun developed three major flavors of these class libraries, one for conventional desktop environments (Java SE), one for a limited mobile environment (Java ME), and one for server‐based Web applications (Java EE).
Android has class libraries. Some of those class libraries were written by the core Android team, made up of Google employees and contributors from other firms. The rest of the class libraries come from other open source projects. Notable among these is Apache Harmony, a project aiming to create a complete replacement implementation of all pieces of Java.
Specifically, Harmony offers a class library that is generally compatible with classes that come from Java itself — they have the same names for classes, for example. Android has included some — but not all — of the Harmony classes in the Android OS. Hence, Android developers can write code that use “Java” classes, despite the fact that those classes did not come from Sun and whose copyrights are not held by Oracle.
Java VM bytecode
Yet another thing that people sometimes refer to as “Java” is the Java virtual machine, or Java VM for short.
Many programming languages are “compiled”, meaning that a tool converts the source code that developers type in into something that a “machine” can execute directly. You can think of this as akin to converting a singer’s voice into the bits and bytes that go into an MP3 file or onto a CD.
Many compiled programming languages are compiled into ”opcodes” that are designed to be run by some specific sort of chip. There are opcode sets for the Intel chip in your notebook, and other opcode sets for the ARM chip in your smartphone, and still other opcode sets for the MIPS chip in your DVD player. If you want your source code to run on all three types of chips, you would need to compile it three times.
Some compiled programming languages, though, target not a real chip, but a fake one — a virtual machine. A virtual machine (VM) is a piece of software that mimics the functionality of a real chip. It runs bytecode (the VM equivalent of opcodes) designed for that specific type of VM. Different versions of the VM software can then be written to run on different types of real chips (Intel, ARM, MIPS, etc.). This way, a compiled VM application can run on a wide range of physical chips, without having to recompile the source code to target each physical chip.
Java is perhaps the most famous language that uses a virtual machine — the JVM. It was not the first and is not the only such language. Other popular languages that use VMs include Perl, Python, and Smalltalk — the latter is the language behind the Squeak app that Apple removed from the App Store for violating its “Apple‐only languages” policy. Similarly, Microsoft’s .NET languages (e.g., VB.Net, C#) run on a virtual machine called the Common Language Runtime (CLR).
The javac compiler that comes with the Java SDK
Still another thing that some people refer to as “Java” is the Java software development kit, or Java SDK (or, occasionally, JDK).
The Java SDK represents the set of tools and files needed by a programmer to write Java applications. Among other things, it includes a compiler tool — javac — that converts Java source code into Java bytecode that can be executed by the Java VM.
Dalvik VM bytecode
The Dalvik VM is a virtual machine, along the lines of the Java VM, the Parrot VM (Perl), Microsoft’s CLR, and so forth. Dalvik was written principally for use with Android, though some have experimented with using it separately.
Each VM has its own bytecode, just as each type of CPU chip has its own opcode. Hence, the Dalvik VM bytecode is not the same as the Java VM bytecode, or the Parrot VM bytecode, etc.
Translate the Java VM bytecode
That being said, Android does come with tools that translate compatible Java VM bytecode into Dalvik VM bytecode. This allows developers to write Java‐syntax source code, compile it with the Java SDK’s compiler, then get Android‐compatible Dalvik VM bytecode in the end.
Note that not all Java VM bytecode is compatible with the translation process, and therefore with Dalvik by extension. Notably, old bytecode (Java 1.4.2 and previous) and bytecode compiled by non‐Sun Java compilers will fail to translate.
An instance of a Dalvik VM
A Java program is run by a Java virtual machine. The VM reads in the Java bytecode, finds the desired entry point (a main method on a designated class), and executes the bytecode instructions. Similarly, an Android program is run by a Dalvik virtual machine.
If you wanted to run two separate Java programs at once, you will usually wind up with two copies of the Java virtual machine running on your computer. Similarly, when you run more than one Android application, each application usually gets its own Dalvik VM instance.
So, Why Dalvik?
OK, so, why did anyone bother to create Dalvik in the first place? Why not just use plain ol’ ordinary Java?
I do not claim to know all of the rationale behind the decision. That being said, here are at least some of the known technical reasons:
Memory Consumption
As noted above, if you want to run more than one Java or Android application, each application gets its own virtual machine instance. However, in Java, that will require a substantial amount of RAM, and on Dalvik it does not.
Why? Sharing.
Much of what is in a VM is read‐only. For example, the class libraries each VM uses do not typically get modified when a program using those libraries is run.
In Java, each application gets its own copy of all the read‐only portions of the VM.
In Dalvik, each application shares one master copy of all the read‐only portions of the VM, using techniques like copy‐on‐write.
As a result, Android, through Dalvik, can run more programs in a tightly‐constrained memory environment, like a smartphone.
Security
Saving memory is good. It allows us to do more with less. However, with Dalvik, there is an extra important benefit: it gives us better security that might otherwise be possible.
Linux has a security model baked into the kernel, one involving users and permissions. Each Linux program is run under a certain user’s account, whether that be a real person or a fictitious account for a particular program (e.g., an apache account that runs a Web server). All files in a proper Linux filesystem are owned by some user. Files that are marked as usable only by the user can be read and written to by that user’s own program, but cannot be read or modified by any other program, since other programs run as other users.
With me so far?
In Android, by default, each application gets its own user account — akin to the apache scenario above. When you run an Android application, it can access its own files, but it cannot access other applications’ files by default, courtesy of it running as a certain user.
This is only possible because of the memory sharing described in the previous section. If Android were limited to a traditional Java VM, each program would take up too much memory. Android applications might all have to share a single Java VM and all run as the same user, meaning one application could access another application’s data. You would have to layer all sorts of security cruft into the Java/Android environment to isolate applications from one another.
But, the memory sharing means each Android application can have its own Dalvik VM and run under its own user account. As a result, we get the tried‐and‐true Linux security model, making it significantly less likely that one application will be able to abscond with another application’s data.
Register‐Based VM
There are two ways to implement a virtual machine, “stack‐based” and “register‐based”. Java VMs tend to be stack‐based. The Dalvik VM is register‐based. This too is an optimization designed for mobile environments, where RAM is limited, as you can get more stuff done in fewer bytes, on average, with a register‐based architecture.

REFERENCES

http://www.groklaw.net/article.php?story=20100915143729255