Java Persistence Performance: 2011

Thursday, June 9, 2011

How to improve JPA performance by 1,825%

The Java Persistence API (JPA) provides a rich persistence architecture. JPA hides much of the low level dull-drum of database access, freeing the application developer from worrying about the database, and allowing them to concentrate on developing the application. However, this abstraction can lead to poor performance, if the application programmer does not consider how their implementation affects database usage.

JPA provides several optimization features and techniques, and some pitfalls waiting to snag the unwary developer. Most JPA providers also provide a plethora of additional optimization features and options. In this blog entry I will explore the various optimizations options and techniques, and a few of the common pitfalls.

The application is a simulated database migration from a MySQL database to an Oracle database. Perhaps there are more optimal ways to migrate a database, but it is surprising how good JPA's performance can be, even in processing hundreds of thousand or even millions of records. Perhaps it is not a straight forward migration, or the application's business logic is required, or perhaps the application has already been persisted through JPA, so using JPA to migrate the database is just easiest. Regardless, this fictitious use case is a useful demonstration of how to achieve good performance with JPA.

The application consists of an Order processing database. The model contains a Customer, Order and OrderLine. The application reads all of the Orders from one database, and persists them to the second database. The source code for the example can be found here.

The initial code for the migration is pretty simple:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order");
EntityManager em = emf.createEntityManager();
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
Query query = emOld.createQuery("Select o from Order o");
List orders = query.getResultList();
em.getTransaction().begin();
// Reset old Ids, so they are assigned from the new database.
for (Order order : orders) {
order.setId(0);
order.getCustomer().setId(0);
}
for (Order order : orders) {
em.persist(order);
for (OrderLine orderLine : order.getOrderLines()) {
em.persist(orderLine);
}
}
em.getTransaction().commit();
em.close();
emOld.close();
emf.close();     
emfOld.close();

The example test runs this migration using 3 variables for the number of Customers, Orders per Customer, and OrderLines per Order. So, 1000 customers, each with 10 orders, and each with 10 order lines, would be 111,000 objects.

The test was run on a virtualized 64 bit Oracle Sun server with 4 virtual cores and 8 gigs of RAM. The databases run on similar machines. The test is single threaded, running in Oracle Sun JDK 1.6. The tests are run using EclipseLink JPA 2.3, and migrating from a MySQL database to an Oracle database.

This code functions fine for a small database migration. But as the database size grows, some issues become apparent. It actually handles 100,000 objects surprisingly well, taking about 2 minutes. This is surprisingly well, given it is thoroughly unoptimized and persisting all 100,000 objects in a single persistence context and transaction.

Optimization #1 - Agent

EclipseLink implements LAZY for OneToOne and ManyToOne relationships using byte code weaving. EclipseLink also uses weaving to perform many other optimizations, such as change tracking and fetch groups. The JPA specification provides the hooks for weaving in EJB 3 compliant application servers, but in Java SE or other application servers weaving is not performed by default. To enable EclipseLink weaving in Java SE for this example the EclipseLink agent is used. This is done using the Java -javaagent:eclipselink.jar option. If dynamic weaving is unavailable in your environment, another option is to use static weaving, for which EclipseLink provides an ant task and command line utility.

Optimization #2 - Pagination

In theory at some point you should run out of memory by bringing the entire database into memory in a single persistence context. So next I increased the size to 1 million objects, and this gave the expect out of memory error. Interestingly this was with only using a heap size of 512 meg. If I had used the entire 8 gigs of RAM, I could, in theory, have persisted around 16 million objects in a single persistence context. If I gave the virtualized machine the full 98 gigs of RAM available on the server, perhaps it would even be possible to persist 100 millions objects. Perhaps we are beyond the day when it does not make sense to pull an entire database into RAM, and perhaps this is no longer such as crazy thing to do. But, for now, lets assume it is an idiotic thing to do, so how can we avoid this?

JPA provides a pagination feature that allows a subset of a query to be read. This is supported in JPA in the Query setFirstResult,setMaxResults API. So instead of reading the entire database in one query, the objects will be read page by page, and each page will be persisted in its own persistence context and transaction. This avoids ever having to read the entire database, and also should, in theory, make the persistence context more optimized by reducing the number of objects it needs to process together.

Switching to using pagination is relatively easy to do for the original orders query, but some issues crop up with the relationship to Customer. Since orders can share the same customer, it is important that each order does not insert a new customer, but uses the existing customer. If the customer for the order was already persisted on a previous page, then the existing one must be used. This requires the usage of a query to find the matching customer in the new database, which introduces some performance issues we will discuss later.

The updated code for the migration using pagination is:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order");
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
Query query = emOld.createQuery("Select o from Order o order by o.id");
int pageSize = 500;
int firstResult = 0;
query.setFirstResult(firstResult);
query.setMaxResults(pageSize);
List orders = query.getResultList();
boolean done = false;
while (!done) {
if (orders.size() < pageSize) {
        done = true;
    }
    EntityManager em = emf.createEntityManager();
    em.getTransaction().begin();
    Query customerQuery = em.createNamedQuery("findCustomByName");
    // Reset old Ids, so they are assigned from the new database.
    for (Order order : orders) {
        order.setId(0);
        customerQuery.setParameter("name", order.getCustomer().getName());
        try {
            Customer customer = (Customer)customerQuery.getSingleResult();
            order.setCustomer(customer);
        } catch (NoResultException notPersistedYet) {
            // Customer does not yet exist, so null out id to have it persisted.
            order.getCustomer().setId(0);
        }
    }
    for (Order order : orders) {
        em.persist(order);
        for (OrderLine orderLine : order.getOrderLines()) {
            em.persist(orderLine);
        }
    }
    em.getTransaction().commit();
    em.close();
    firstResult = firstResult + pageSize;
    query.setFirstResult(firstResult);
    if (!done) {
        orders = query.getResultList();
    }
}
emOld.close();
emf.close();     
emfOld.close();

Optimization #3 - Query Cache

This will introduce a lot of queries for customer by name (10,000 to be exact), one for each order. This is not very efficient, and can be improved through caching. In EclipseLink there is both an object cache and a query cache. The object cache is enabled by default, but objects are only cached by Id, so this does not help us on the query using the customer's name. So, we can enable a query cache for this query. A query cache is specific to the query, and caches the query results keyed on the query name and its parameters. A query cache is enabled in EclipseLink through using the query hint "eclipselink.query-results-cache"="true". This should be set where the query is defined, in this case in the orm.xml. This will reduce the number of queries for customer to 1,000, which is much better.

There are other solutions to using the query cache. EclipseLink also supports in-memory querying. In-memory querying means evaluating the query on all of the objects in the object cache, instead of accessing the database. In-memory querying is enabled through the query hint "eclipselink.cache-usage"="CheckCacheOnly". If you enabled a full cache on customer, then as you persisted the orders all of the existing customers would be in the cache, and you would never need to access the database. Another manual solution is to maintain a Map in the migration code keying the new customer's by name. For all of the above solutions if the cache is made fixed sized (query cache defaults to a size of 100), you would never need all of the customers in memory at the same time, so there would be no memory issues.

Optimization #4 - Batch Fetch

The most common performance issue in JPA is in the fetch of relationships. If you query n orders, and access their order-lines, you get n queries for order-line. This can be optimized through join fetching and batch fetching. Join fetching, joins the relationship in the original query and selects from both tables. Batch fetch executes a second query for the related objects, but fetches them all at once, instead of one by one. Because we are using pagination, this make optimizing the fetch a little more tricky. Join fetch which still work, but since order-lines is join fetched, and there are 10 order-lines per order, the page size that was 500 orders, in now only 50 orders (and their 500 order-lines). We can resolve this by increasing the page size to 5000, but given in a real application the number of order-lines in not fixed, this becomes a bit of a guess. But the page size was just a heuristic number anyway, so no real issue. Another issue with join fetching with pagination is the last and first object may not have all of its related objects, if it falls in-between a page. Fortunately EclipseLink is smart enough to handle this, but it does require 2 extra queries for the first and last order of each page. Join fetching also has the draw back that it is selecting more data when a OneToMany is join fetched. Join fetching is enable in JPQL using join fetch o.orderLine.

Batch fetching normally works by joining the original query with the relationship query, but because the original query used pagination, this will not work. EclipseLink supports three types of batch fetching, JOIN, EXISTS, and IN. IN works with pagination, so we can use IN batch fetching. Batch fetch is enabled through the query hint "eclipselink.batch"="o.orderLines", and "eclipselink.batch.type"="IN". This will reduce the n queries for order-line to 1. So for each batch/page of 500 orders, there will be 1 query for the page of orders, and 1 query for the order-lines, and 50 queries for customer.

Optimization #5 - Read Only

The application is migrating from the MySQL database to the Oracle database. So is only reading from MySQL. When you execute a query in JPA, all of the resulting objects become managed as part of the current persistence context. This is wasteful in JPA, as managed objects are tracked for changes and registered with the persistence context. EclipseLink provides a "eclipselink.read-only"="true" query hint that allows the persistence context to be bypassed. This can be used for the migration, as the objects from MySQL will not be written back to MySQL.

Optimization #6 - Sequence Pre-allocation

We have optimized the first part of the application, reading from the MySQL database. The second part is to optimize the writing to Oracle.

The biggest issue with the writing process is that the Id generation is using an allocation size of 1. This means that for every insert there will be an update and a select for the next sequence number. This is a major issue, as it is effectively doubling the amount of database access. By default JPA uses a pre-allocation size of 50 for TABLE and SEQUENCE Id generation, and 1 for IDENTITY Id generation (a very good reason to never use IDENTITY Id generation). But frequently applications are unnecessarily paranoid of holes in their Id values and set the pre-allocaiton value to 1. By changing the pre-allocation size from 1 to 500, we reduce about 1000 database accesses per page.

Optimization #7 - Cascade Persist

I must admit I intentionally added the next issue to the original code. Notice in the for loop persisting the orders, I also loop over the order-lines and persist them. This would be required if the order did not cascade the persist operation to order-line. However, I also made the orderLines relationship cascade, as well as order-line's order relationship. The JPA spec defines somewhat unusual semantics to its persist operation, requiring that the cascade persist be called every time persist is called, even if the object is an existing object. This makes cascading persist a potentially dangerous thing to do, as it could trigger a traversal of your entire object model on every persist call. This is an important point, and I added this issue purposefully to highlight this point, as it is a common mistake made in JPA applications. The cascade persists causes each persist call to order-line to persist its order, and every order-line of the order again. This results in an n^2 number of persist calls. Fortunately there are only 10 order-lines per order, so this only results in 100 extra persist calls per order. It could have been much worse if the customer defined a relationship back to its orders, then you would have 1000 extra calls per order. The persist does not need to do anything, as the objects are already persisted, but the traversal can be expensive. So, in JPA you should either mark your relationships cascade persist, or call persist in your code, but not both. In general I would recommend only cascading persist for logically dependent relationships (i.e. things that would also cascade remove).

Optimization #8 - Batch Writing

Many databases provide an optimization that allows a batch of write operations to be performed as a single database access. There is both parametrized and dynamic batch writing. For parametrized batch writing a single parametrized SQL statement can be executed with a batch of parameter vales instead of a single set of parameter values. This is very optimal as the SQL only needs to be executed once, and all of the data can be passed optimally to the database.

Dynamic batch writing requires dynamic (non-parametrized) SQL that is batched into a single big statement and sent to the database all at once. The database then needs to process this huge string and execute each statement. This requires the database do a lot of work parsing the statement, so is no always optimal. It does reduce the database access, so if the database is remote or poorly connected with the application, this can result in an improvement.

In general parametrized batch writing is much more optimal, and on Oracle it provides a huge benefit, where as dynamic does not. JDBC defines the API for batch writing, but not all JDBC drivers support it, some support the API but then execute the statements one by one, so it is important to test that your database supports the optimization before using it. In EclipseLink batch writing is enabled using the persistence unit property "eclipselink.jdbc.batch-writing"="JDBC".

Another important aspect of using batch writing is that you must have the same SQL (DML actually) statement being executed in a grouped fashion in a single transaction. Some JPA providers do not order their DML, so you can end up ping-ponging between two statements such as the order insert and the order-line insert, making batch writing in-effective. Fortunately EclipseLink orders and groups its DML, so usage of batch writing reduces the database access from 500 order inserts and 5000 order-line inserts to 55 (default batch size is 100). We could increase the batch size using "eclipselink.jdbc.batch-writing.size", so increasing the batch size to 1000 reduces the database accesses to 6 per page.

Optimization #9 - Statement caching

Every time you execute an SQL statement, the database must parse that statement and execute it. Most of the time application executes the same set of SQL statements over and over. By using parametrized SQL and caching the prepared statement you can avoid the cost of having the database parse the statement.

There are two levels of statement caching. One done on the database, and one done on the JDBC client. Most databases maintain a parse cache automatically, so you only need to use parametrized SQL to make use of it. Caching the statement on the JDBC client normally provides the bigger benefit, but requires some work. If your JPA provider is providing you with your JDBC connections, then it is responsible for statement caching. If you are using a DataSource, such as in an application server, then the DataSource is responsible for statement caching, and you must enable it in your DataSource config. In EclipseLink, when using EclipseLink's connection pooling, you can enable statement caching using the persistence unit property "eclipselink.jdbc.cache-statements"="true". EclipseLink uses parametrized SQL by default, so this does not need to be configured.

Optimization #10 - Disabling Caching

By default EclipseLink maintains a shared 2nd level object cache. This normally is a good thing, and improves read performance significantly. However, in our application we are only inserting into Oracle, and never reading, so there is no point to maintaining a shared cache. We can disable this using the EclipseLink persistence unit property "eclipselink.cache.shared.default"="false". However, we are reading customer, so we can enable caching for customer using, "eclipselink.cache.shared.Customer"="true".

Optimization #11 - Other Optimizations

EclipseLink provides several other more specific optimizations. I would not really recommend all of these in general as they are fairly minor, and have certain caveats, but they are useful in use cases such as migration where the process is well defined.

These include the following persistence unit properties:

"eclipselink.persistence-context.flush-mode"="commit" - Avoids the cost of flushing on every query execution.
"eclipselink.persistence-context.close-on-commit"="true" - Avoids the cost of resuming the persistence context after the commit.
"eclipselink.persistence-context.persist-on-commit"="false" - Avoids the cost of traversing and persisting all objects on commit.
"eclipselink.logging.level"="off" - Avoids some logging overhead.

The fully optimized code:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order-opt");
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
System.out.println("Migrating database.");
Query query = emOld.createQuery("Select o from Order o order by o.id");
// Optimization #2 - batch fetch
// #2 - a - join fetch
//Query query = emOld.createQuery("Select o from Order o join fetch o.orderLines"); // #2 - b - batch fetch (batch fetch is more optimal as avoids duplication of Order data)
query.setHint("eclipselink.batch", "o.orderLines"); query.setHint("eclipselink.batch.type", "IN");
// Optimization #3 - read-only
query.setHint("eclipselink.read-only", "true");
// Optimization #4 - pagination int pageSize = 500; int firstResult = 0; query.setFirstResult(firstResult);
query.setMaxResults(pageSize); 
List orders = query.getResultList();
boolean done = false;
while (!done) {
    if (orders.size() < pageSize) {
        done = true;
    }
    EntityManager em = emf.createEntityManager();
    em.getTransaction().begin();
    Query customerQuery = em.createNamedQuery("findCustomByName");
    // Reset old Ids, so they are assigned from the new database.
    for (Order order : orders) {
        order.setId(0);
        customerQuery.setParameter("name", order.getCustomer().getName());
        try {
            Customer customer = (Customer)customerQuery.getSingleResult();
            order.setCustomer(customer);
        } catch (NoResultException notPersistedYet) {
            // Customer does not yet exist, so null out id to have it persisted.
            order.getCustomer().setId(0);
        }
    }
    for (Order order : orders) {
        em.persist(order);
        // Optimization #5 - avoid n^2 persist calls
        //for (OrderLine orderLine : order.getOrderLines()) {
        //    em.persist(orderLine);
        //}
    }
    em.getTransaction().commit();
    em.close();
    firstResult = firstResult + pageSize;
    query.setFirstResult(firstResult);
    if (!done) {
        orders = query.getResultList();
    }
}
emOld.close();
emf.close();     
emfOld.close();

The optimized persistence.xml:

<persistence-unit name="order-opt" transaction-type="RESOURCE_LOCAL">
    <!--  Optimization #7, 8 - sequence preallocation, query result cache -->
    <mapping-file>META-INF/order-orm.xml</mapping-file>
    <class>model.Order</class>
    <class>model.OrderLine</class>
    <class>model.Customer</class>
    <properties>
        <!-- Change this to access your own database. -->
        <property name="javax.persistence.jdbc.driver" value="oracle.jdbc.OracleDriver" />
        <property name="javax.persistence.jdbc.url" value="jdbc:oracle:thin:@ottvm028.ca.oracle.com:1521:TOPLINK" />
        <property name="javax.persistence.jdbc.user" value="jsutherl" />
        <property name="javax.persistence.jdbc.password" value="password" />
        <property name="eclipselink.ddl-generation" value="create-tables" />
        <!--  Optimization #9 - statement caching -->
        <property name="eclipselink.jdbc.cache-statements" value="true" />
        <!--  Optimization #10 - batch writing -->
        <property name="eclipselink.jdbc.batch-writing" value="JDBC" />
        <property name="eclipselink.jdbc.batch-writing.size" value="1000" />
        <!--  Optimization #11 - disable caching for batch insert (caching only improves reads, so only adds overhead for inserts) -->
        <property name="eclipselink.cache.shared.default" value="false" />
        <!--  Except for Customer which is shared by orders -->
        <property name="eclipselink.cache.shared.Customer" value="true" />
        <!--  Optimization #12 - turn logging off -->
        <!-- property name="eclipselink.logging.level" value="FINE" /-->
        <property name="eclipselink.logging.level" value="off" />
        <!--  Optimization #13 - close EntityManager on commit, to avoid cost of resume -->
        <property name="eclipselink.persistence-context.close-on-commit" value="true" />
        <!--  Optimization #14 - avoid auto flush cost on query execution -->
        <property name="eclipselink.persistence-context.flush-mode" value="commit" />
        <!--  Optimization #15 - avoid cost of persist on commit -->
        <property name="eclipselink.persistence-context.persist-on-commit" value="false" />
    </properties>
</persistence-unit>

The optimized orm.xml:

<?xml version="1.0" encoding="UTF-8"?>
<entity-mappings version="2.1"
    xmlns="http://www.eclipse.org/eclipselink/xsds/persistence/orm"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <named-query name="findCustomByName">
        <query>Select c from Customer c where c.name = :name</query>
        <hint name="eclipselink.query-results-cache" value="true"/>
    </named-query>
    <entity class="model.Order">
        <table-generator name="ORD_SEQ" allocation-size="500"/>
    </entity>
    <entity class="model.Customer">
            <table-generator name="CUST_SEQ" allocation-size="500"/>
    </entity>

</entity-mappings>

So, what is the result? The original un-optimized code took on average 133,496 milliseconds (~2 minutes) to process ~100,000 objects. The fully optimized code took only 6,933 milliseconds (6 seconds). This is very good, and means it could process 1 million objects in about 1 minute. The optimized code is an 1,825% improvement on the original code.

But, how much did each optimization affect this final result? To answer this question I ran the test 3 times with the fully optimized version, but with each optimization missing. This worked out better than starting with the unoptimized version and only adding each operation separately, as some optimizations get masked by the lack of others. So, in the table below the bigger the % difference, the better the optimization (that was removed) was.

Optimization	Average Result (ms)	% Difference
None	133,496	1,825%
All	6,933	0%
1 - no agent	7,906	14%
2 - no pagination	8,679	25%
3 - no read-only	8,323	20%
4a - join fetch	11,836	71%
4b - no batch fetch	17,344	150%
5 - no sequence pre-allocation	30,396	338%
6 - no persist loop	7,947	14%
7 - no batch writing	75,751	992%
8 - no statement cache	7,233	4%
9 - with cache	7,925	14%
10 - other	7,332	6%

This shows that batch writing was the best optimization, followed by sequence pre-allocation, then batch fetching.

Wednesday, May 4, 2011

Data Partitioning - Scaling the Database

In Enterprise Java most of the effort is normally done to scale the mid-tier application and its server. Pretty much every Java application server supports clustering and scaling out the application to several mid-tier machines. Even if the application server does not officially support clustering, it is normally pretty easy to have a "cluster" of application servers fronted by a round-robin load-balancer. This would even work with something as simple as Tomcat.

The mid-tier normally scales very well to a cluster, as it does not have any shared in-memory data, as this data is normally stored in a database. All of the mid-tier cluster members access the same database, and life is good. The application can have unlimited performance, simply by adding more mid-tier machines. But what happens when the poor database machine suddenly can't take any more requests?

The most common solution to scaling the database seems to be to buy a bigger and badder database machine. If 8 cores is not cutting it, then perhaps 16 cores will. This solution in general works pretty good, assuming hardware vendors can keep stuffing more cores into their machines. This solution seems to be used in Enterprise Java performance benchmarks such as SpecJ, if you look at the nodes column of the SpecJ 2010 results, all the results have a single database node, some with as many as 40 cores, even though some have 8 mid-tier nodes.

http://www.spec.org/jEnterprise2010/results/jEnterprise2010.html

But what happens when you can't stuff any more cores into a machine, or the cost of an insanely multi-core machine greatly outweighs the cost of multiple lower end hardware machines? Perhaps this is just a hardware problem, and you just need to wait for the hardware vendors to make a bigger and badder machine, but there are other solutions from the wonderful world of software.

The first solution to look into is to optimize your own application (as always). Generally the application's code is not so efficient in its database access, and by optimizing the number and types of queries hitting the database, using parametrized SQL, using batch writing, using lazy, join and batch fetching, a significant load can be removed from the database. But perhaps you already did that, or don't have the expertise, or just don't feel like it.

The second solution is to optimize your database. By ensuring your database is configured optimally, has the correct indexes, queries are using the optimal query plan, and the disk access optimally, its performance, and thus scalability can be improved. But perhaps you already did that, or don't have the expertise, or just don't feel like it.

The third solution is to investigate caching in the mid-tier. By caching objects and data in the mid-tier, you can offload a lot of the queries hitting the database, and improve your application's performance to boot. Most JPA providers support caching, and some such as EclipseLink offer quite advanced caching functionality including invalidation, and coordinated clustered caches. JPA 2.0 defines some basic caching annotations to enable and access the cache.

Caching mainly benefits reads, but some caching solutions such as Oracle Coherence offer the ability to offload writes as well. Oracle TopLink Grid provides JPA support for Coherence.

Caching can be a good solution, but there can be issues with stale data, clustering, and mid-tier contention. Some caching solution are very good, but not always as good at managing concurrent access to data as relational databases that have been doing it for decades. Also if your database has become a bottleneck because of writes, then caching reads may not be a solution.

The best solution is to scale the database through clustering the database across multiple machines. This could be a real clustered database, such as Oracle RAC, or just multiple regular database instances. Clustered database are good, and can improve your scalability without much work, but depending on your application you may also have to partition your data across the database nodes for optimal scalability. Without partitioning, if you write a row on one node, then access it on another, the other node must request the latest copy of the data from the other node, this can potentially make performance worse.

Partitioning splits your data across each of the database nodes. There is horizontal partitioning, and vertical partitioning. Vertical partitioning is normally the easiest to implement. You can just put half of your classes on one database, and the other half on another. Ideally the two sets would be isolated from each other and not have any cross database relationships.

For horizontal partitioning you need to split your data across multiple database nodes. Each database node will have the same tables, but each node's table will only store part of the data. You can partition the data by the data values, such as range partitioning, value partitioning, hash partitioning, or even round robin.

To enable data partitioning you require your persistence solution to be aware of how to partition the data. EclipseLink 2.2 added support for partitioning data. Both vertical and horizontal partitioning is supported. Several partitioning options are provided at the Session, Entity and Query level,

Range partitioning - each partition maps a range of field values.
Value partitioning - each field value maps to a partition.
Hash partitioning - the field value is hashed to determine its partition.
Pinned partitioning - allows an Entity or query to be vertically partitioned.
Round robin - allows load balancing of requests across multiple database nodes.
Replication - allows data to be replicated across multiple database nodes.

EclipseLink supports partitioning across any database, including both clustered databases such as Oracle RAC, and standard databases such as MySQL. Relationships and queries across database partitions are supported, but joins across partitions are only supported for clustered databases.

So how does data partitioning with JPA and EclipseLink scale? To determine the answer, I developed a simple order processing example. The example defines an Order, OrderLine and Customer. The example client processes orders for a Customer using 16 client threads. The application is primarily insert oriented, so heavily uses the database. I first ran the application without partitioning on a single MySQL database instance. To give the poor database no chance of keeping up to the mid-tier client, I ran the mid-tier on a virtualized 8 core machine with 16g of ram (Oracle Sun hardware, Oracle Linux OS). I ran the MySQL database on a similar machine, but only gave it 1 virtual core and 8g or RAM. So, I was pretty sure the application would be bottlenecking on the database. This was the goal, to simulate a cluster of mid-tier machines accessing a single database machine.

Next, I enable partitioning of the Order and OrderLine by the ORDER_ID using hash partitioning across two database nodes. I also hash partitioned Customer by its ID. This resulted in about half of the transactions going to one database, and half to the other. Because the Order and the OrderLine shared the same ORDER_ID, they were partitioned to the same database node, so I did not need to worry about transactions spanning multiple nodes. The read for the Customer could go to either node, but because it was a non-transactional read, this was just routed separately by EclipseLink, which has support for using different connections for non-transactional reads versus transactional writes. Having writes span multiple nodes in a single transaction is normally not desirable. EclipseLink allows this and can be integrated with JTA to give 2-phase commit across the nodes. If JTA is not used EclipseLink still does a 2-stage commit, but there are no gaurentees if all of the writes succeed, but the commit transaction fails.

The resulting order was mapped as,

@Entity
@Table(name="PART_ORDER")
@HashPartitioning(
    name="HashPartitionByOrderId",
    partitionColumn=@Column(name="ORDER_ID"),
    connectionPools={"default","node2"})
@Partitioned("HashPartitionByOrderId")
public class Order implements Serializable {
    @Id
    @GeneratedValue(strategy=GenerationType.TABLE)
    @Column(name="ORDER_ID")
    private long id;
    @Basic
    private String description;
    @Basic
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    @OneToMany(mappedBy="order", cascade=CascadeType.ALL, orphanRemoval=true)
    @OrderBy("lineNumber")
    private List orderLines = new ArrayList();
    @ManyToOne(fetch=FetchType.LAZY)
    private Customer customer;
}

For the full source code for the example see here.

For the second run a 2nd MySQL database was added running on a separate 1 core machine. The result showed a 66% increase in scalability for the application, processing close to 2x as many orders. The test application was run for 1 minute and the total number of processed orders for all 16 client threads totaled. This was run 5 times and the results averaged for each configuration.

The results:

Configuration	Threads	Average processed orders	%STD	%DIF
Single database	16	11,150	0.4%	0%
2 node partition	16	18,583	2.2%	66%

The results show that through effective partitioning the database can be scaled out to multiple machines as well as the mid-tier.

Monday, March 7, 2011

JVM Performance - Part III - Concurrent Maps

Concurent Maps

The main difference between Hashtable and HashMap is that Hashtable is synchronized. For this reason Hashtable is still used in a lot of concurrent code because it is, in theory, thread safe. This theory is however normally just a theory, because if you don't write concurrent code correctly, it will still not be thread safe no matter how many synchronized methods you have.

For example you could call get() on the Hashtable, then if it is not there call put(), both operations are synchronized and thread-safe, but in between your get() and put() another thread could have done the same thing and put something there already, in which case your code may be incorrect and have thrown away some other thread's data. With a Hashtable the solution to this is to synchronize the whole operation on the map.

Object value = map.get(key);
if (value == null) {
    synchronized (map) {
        value = map.get(key);
        if (value == null) {
            value = buildValue(key);
            map.put(key, value);
        }
    }
}

JDK 1.5 added the ConcurrentMap implementation that is thread safe, and designed and optimized for concurrent access. It basically has pages inside the map, to avoid locks on concurrent access to different pages. It also provides useful API such as putIfAbsent() to allow something to be put in the map unless it is already there, in a thread safe manner. Using putIfAbsent() is more efficient than using a synchronized get() and put() in both concurrency and performance.

Object value = map.get(key);
if (value == null) {
    Object newValue = buildValue(key);
    value = map.putIfAbsent(key, value);
    if (value == null) {
        value = newValue;
    }
}

So, how does the performance and concurrency of HashMap, Hashtable and ConcurrentMap stack up? This test compare the performance for gets and puts in various Map implementations using 1 to 32 threads. It does 100 gets or puts in a Map of size 100. Two machines were tested. The first machines is my Windows XP desktop, that has two cores. The second machine is an 8 core Linux server. All tests were run 5 times and averaged, Oracle Sun JDK 1.6.23 was used.

Threads, is the number of thread running the test. The average is the total number of operations performed in the time period by all threads in total. The %STD is the percentage standard deviation in the results. The %DIF is the percentage difference between the run and the single threaded run (for the same Map type).

Concurrent Map Performance Comparison (desktop, 2cpu)

Map	Opperation	Threads	Average	%STD	%DIF (with 1 thread)
HashMap	get	1	3551306	0.06%	0%
HashMap	get	2	4121102	0.03%	16%
HashMap	get	4	4132506	0.15%	16%
HashMap	get	8	4227485	0.68%	19%
HashMap	get	16	4402532	1.36%	23%
HashMap	get	32	4426514	1.61%	24%
Hashtable	get	1	1132956	0.06%	0%
Hashtable	get	2	364236	0.08%	-211%
Hashtable	get	4	274603	0.14%	-312%
Hashtable	get	8	277188	1.08%	-308%
Hashtable	get	16	277881	0.78%	-307%
Hashtable	get	32	296779	2.51%	-281%
ConcurrentHashMap	get	1	2771098	0.04%	0%
ConcurrentHashMap	get	2	3466451	2.30%	25%
ConcurrentHashMap	get	4	3458492	0.33%	24%
ConcurrentHashMap	get	8	3510282	0.31%	26%
ConcurrentHashMap	get	16	3613182	2.36%	30%
ConcurrentHashMap	get	32	3599489	2.23%	29%
HashMap	put	1	3897925	0.07%	0%
HashMap	put	2	2614784	0.01%	-49%
HashMap	put	4	2473011	0.21%	-57%
HashMap	put	8	2482743	0.30%	-57%
HashMap	put	16	2506519	0.53%	-55%
HashMap	put	32	2579715	0.30%	-51%
Hashtable	put	1	1042076	0.33%	0%
Hashtable	put	2	474199	3.06%	-119%
Hashtable	put	4	179550	6.71%	-480%
Hashtable	put	8	183102	1.63%	-469%
Hashtable	put	16	393085	0.68%	-165%
Hashtable	put	32	398277	1.10%	-161%
ConcurrentHashMap	put	1	1336292	0.21%	0%
ConcurrentHashMap	put	2	557880	3.71%	-139%
ConcurrentHashMap	put	4	390736	1.64%	-241%
ConcurrentHashMap	put	8	362653	1.21%	-268%
ConcurrentHashMap	put	16	1492123	0.20%	11%
ConcurrentHashMap	put	32	1564926	0.10%	17%

Concurrent Map Performance Comparison (server, 8cpu)

Map	Opperation	Threads	Average	%STD	%DIF (with 1 thread)
HashMap	get	1	3047533	0.0%	0%
HashMap	get	2	7500603	0.1%	146%
HashMap	get	4	14080828	0.01%	362%
HashMap	get	8	25160569	0.01%	769%
HashMap	get	16	17215757	1.2%	464%
HashMap	get	32	11797330	7.7%	287%
Hashtable	get	1	1165834	0.06%	0%
Hashtable	get	2	434485	16.9%	-168%
Hashtable	get	4	203231	2.7%	-473%
Hashtable	get	8	201290	2.1%	-479%
Hashtable	get	16	358459	2.3%	-225%
Hashtable	get	32	303975	4.7%	-283%
ConcurrentHashMap	get	1	2119602	0.0%	0%
ConcurrentHashMap	get	2	5044317	0.1%	137%
ConcurrentHashMap	get	4	9422460	0.09%	344%
ConcurrentHashMap	get	8	10195480	0.0%	381%
ConcurrentHashMap	get	16	9799273	1.2%	362%
ConcurrentHashMap	get	32	9557975	0.1%	350%
HashMap	put	1	1729801	0.02%	0%
HashMap	put	2	1347323	0.1%	-28%
HashMap	put	4	1267770	0.02%	-36%
HashMap	put	8	1056226	0.0%	-63%
HashMap	put	16	1055462	0.01%	-63%
HashMap	put	32	1055139	0.01%	-63%
Hashtable	put	1	1391458	0.08%	0%
Hashtable	put	2	211793	13.1%	-556%
Hashtable	put	4	191052	2.8%	-628%
Hashtable	put	8	200480	3.4%	-594%
Hashtable	put	16	399748	2.3%	-248%
Hashtable	put	32	400840	3.2%	-247%
ConcurrentHashMap	put	1	1503588	0.2%	0%
ConcurrentHashMap	put	2	441143	0.8%	-240%
ConcurrentHashMap	put	4	380565	1.1%	-295%
ConcurrentHashMap	put	8	354054	11.8%	-324%
ConcurrentHashMap	put	16	1736618	2.8%	15%
ConcurrentHashMap	put	32	1699647	5.7%	13%

Very interesting results. Given the desktop machine has 2 CPUs, I would have expected the 2+ threaded tests to have at most 2x the single threaded test. For the server results with 8 CPUs, I would expect the results to double until 8 threads, then flatten out.

Given Hashtable is synchronized, HashMap is not, and ConcurrentHashMap is partially synchronized, I would have expect HashMap to be about 2x, Hashtable to be about 1x, and ConcurrentHashMap to be somewhere in between. The thing I like best about running performance tests, is that you rarely get what you expect, and these results are not what I would have expected.

My basic premise holds, in that for get() HashMap had the best concurrency, then ConcurrentHashMap, and then Hashtable. I would have not expected Hashtable to do so bad though. Given it is synchronized, only one thread can perform a get() at one time, so naively one would expect the same results as a single thread. The reality is that it had 5x worse performance than a single thread. The reasons for this include that synchronization has a certain overhead, which modern JVMs optimize out when only a single thread is accessing the object, but with multiple threads this overhead becomes very apparent. Also, contention in general has a huge performance cost, as the threads are busy waiting for the lock to become available. In addition just having multiple threads adds some overhead with context switching and such, so in general the more threads, the worse the peformance unless the threads are doing something useful.

The desktop results for get() are worse than I would have expected with 2 CPUs, but perhaps the second CPU was busy with other things, such as the OS and garbage collection, etc.

The put() results are much more perplexing. First of all, I know that running concurrent puts on a HashMap does not make much sense, as HashMap does not support concurrent puts. I ran the test anyway, just to see what would happen, and the result is quite surprising. I would expect similar results to the get() test. However, HashMap had much worse results, similar to Hashtable, as if it were having some contention. How could this be given that HashMap has no synchronized methods? My only explanation is that the puts required modifying the same memory locations, so were experiencing contention on the memory access. If you have a better explanation, please comment.

I would have also expected the put() concurrency for ConcurrentHashMap to be better. I think the primary reason for this was that my tests looped over the same set of keys in the same order, so each thread was trying to put the same key at the same time. Since ConcurrentHashMap works by having multiple pages and only having to lock a page on put() instead of the entire Map, it still had contention because for the most part the threads were accessing the same keys at the same time. I think this is also why the > 8 threads fared better. With more threads they got more out of synch, and had less page conflicts. This is an important point, ConcurrentHashMap will only perform well when it has a large enough size to have many pages, and when the access to it is random. If all the threads are accessing a single page, it is really no better than a Hashtable.

So, what does all this mean? In general, if you have static meta-data that is read-only and requires concurrent access, then use HashMap (it is only not thread safe with puts, gets are fine). If you require concurrent read-write access, then use ConcurrentHashMap. If you just like being old school, then use Hashtable, but beware the hidden costs of concurrency.

Wednesday, January 26, 2011

JVM Performance, Part II - JVM Bazaar

As promised, I have results from running some of the tests in other JVMs. The most interesting results were from the Map test and the Method execution test, so I have re-run those on various JVMs and environments.

The first result comes from running with Oracle (Sun) JDK 1.6.23 on the same Windows XP machine. Next I ran on Oracle (Sun) JDK 1.5.5, and then Oracle JRockit 1.6. I then ran the tests on another machine, an old 4 cpu 8 core AMD machine running Linux. The results are very surprising, each different machine and environment seemed to have very different behavior.

Note: These tests are not in any way attempting to compare Windows vs Linux performance, or Oracle Sun JDK vs Oracle JRockit, completely different hardware, JVMs and environments are used. The goal is to see how different operations compare on the same JVM, and how one JVM and environment can differ with another. Some of the Windows results are faster most likely because the machine is newer, and a faster CPU, some of the Linux results are faster possibly because the machine has 8 CPUs instead of 2, so can process garbage collection on other threads.

The test uses a Map of size 100 testing instantiation of the Map, 100 puts and 100 gets. The average result is included. The %DIF is between the Map and the Hashtable results for that same JVM.

Map Operation Performance Comparison

Map	WXP-JDK1.6.7	%DIF	WXP-JDK1.6.23	%DIF	WXP-JDK1.5.5	%DIF
Hashtable	357229	0%	379673	0%	219961	0%
HashMap	274587	-30%	403501	+6.2%	238916	+8.6%
LinkedHashMap	269840	-32%	356207	-6.5%	228787	+4.0%
IdentityHashMap	110801	-222%	121691	-211%	60670	-262%
ConcurrentHashMap	119068	-200%	183159	-107%	109912	-100%
HashSet	281960	-26%	407709	+7.3%	228936	4.0%

Map	LX-JDK1.6.20	%DIF	LX-JDK1.5.5	%DIF	LX-JRK1.6.5	%DIF
Hashtable	343250	0%	307227	0%	474378	0%
HashMap	496258	+44%	375805	+22%	642267	+35%
LinkedHashMap	476535	+38%	417273	+35%	595331	+25%
IdentityHashMap	552437	+60%	478350	+55%	716368	+51%
ConcurrentHashMap	293529	-16%	299675	-2.5%	331730	-43%
HashSet	451615	+31%	381843	+24%	633824	+33%

The first good thing to notice is that for the same machine, the results get better with newer JVM's. That is very nice, and something that most Java developers are familiar with, each new JVM version having better performance than the previous.

The second things to notice is that the Map types have huge variations based on the JVM. HashMap was faster than Hashtable in 1.6.23 and even 1.5.5, but somehow slower in 1.6.7. I found this very perplexing, but no matter how many times, or in which order I ran the tests the results were the same. I also compared the code between the releases, and did not find any code changes that could account for the change in performance. I assume it has to do with how the JVM manages memory and synchronized methods. One would think that the code for Hashtable, HashMap, and IdentityHashMap was all the same, since the functionality is the essentially identical, but in fact they share none of the same code, and use quite different data structures. I assume this accounts for the difference in performance under the different JVMs, so that none of them is inferior, they are just different.

Note that these tests are still single threaded tests. So although they are comparing the raw performance difference between Hashtable and HashMap, the far bigger issue is the affect Hashtable's synchronized methods have on concurrency. This is also true for ConcurrentHashMap, altough its raw throughput seems to be worse, it concurrency is far better, and in a mutli-threaded environment it will fair much better. I will hopefully explore this in another blog post.

Method Execution Performance Comparison

Execution type	WXP-JDK1.6.7	%DIF	WXP-JDK1.6.23	%DIF	WXP-JDK1.5.5	%DIF
Normal	25533130	0%	29608720	0%	18557941	0%
synchronized	13383707	-93%	14373780	-105%	1254183	-1379%
Block synchronized	8244087	-203%	8812159	-235%	1253624	-1380%
final	26873873	+6.1%	30812911	+4.0%	18853714	+1.5%
In-lined	26816109	+5.2%	30815433	+4.0%	19734119	+6.3%
volatile	1503727	-1539%	3290550	-799%	1448561	-1181%
Reflection	159069	-1564%	145585	-20237%	82713	-22336%

Execution type	LX-JDK1.5.5	%DIF	LX-JDK1.6.20	%DIF	LX-JRK1.6.5	%DIF
Normal	4666753	0%	4840361	0%	4257995	0%
synchronized	2250871	-107%	4413458	-9.6%	3465398	-22%
Block synchronized	2252474	-107%	4410721	-9.7%	3672440	-15%
final	4666384	0.0%	4911130	+1.4%	4392700	+3.1%
In-lined	4668538	0.0%	4874539	0.7%	4042412	-5.3%
volatile	3236491	-44%	3231105	-49%	3032121	-40%
Reflection	2645911	-76%	2568412	-88%	132232	-3120%

The first good thing to note again is that newer JVM have for the most part better performance. This is one of the great aspects of developing using the Java platform.

The second thing to notice is once again the behavior for the various JVMs is very different. They all seem to be consistent in that synchronized methods, volatile and reflection are slower, but the degree of the difference is pretty major. Synchronized methods seem to be much better in JDK 1.6 vs 1.5, but in the latest Linux JVM almost have no overhead at all. Reflection also improved a lot from 1.5 to 1.6, but again the Linux overhead is less than a tenth of the Windows JVM results (except for JRockit). Volatile has similar differences. This could have more to do with the hardware than the OS or JVM, as the Linux machine has 8 CPUs, so may be doing some memory managment using its other CPUs (but this does not explain the JRockit result...).

Summary

So what have we learned? I think the most important thing is that different JVMs and environment can have very different behavior, so it is important to test in the environment that you will go into production in. If you do not have that luxury, then it is important to test in a variety of different environments to ensure you are not making trade offs in one environment that could hurt you in another.

I would not get too obsessed with worrying about how your applications performance may differ in different environments. By in large, the main performance optimizations of reducing CPU usages, reducing message sends, reducing memory and garbage, improving concurrency will improve your application's performance not matter what the environment is.

Subscribe to: Posts ( Atom )

Pages