Tuesday, November 13, 2012

EclipseLink 15x faster than other JPA providers

I'm sorry to disappoint those of you looking to see some canned benchmark showing how EclipseLink dominates over all competitors in performance and scalability.

It is not that this would be difficult to do, or that I don't have such a benchmark that shows this, but that I give the JPA community credit for not attributing much weight to a benchmark produced by a single JPA provider that shows their JPA product excels against their competitors. If you really want such a benchmark, scroll to the bottom of this post.

There are many such provider produced benchmarks out there. There are websites, blog posts, and public forum posts from these vendors marketing their claims. If you believe these claims, and wish to migrate your application to some previously unheard of JPA provider, then I have an email that I would like to send to you from a Nigerian friend who needs help transferring money into your country.

The only respectable benchmark that I know of that heavily utilizes JPA is the SPECjEnterprise2010 benchmark from the Standard Performance Evaluation Corporation, a respected, independent standards body that produces many industry standard benchmarks. SpecJ is more than a JPA benchmark, as it measures the performance of the entire JavaEE stack, but JPA is a major component in it.

One could argue that SpecJ is too broad of a benchmark, and defining the performance of an entire JavaEE stack with a single number is not very meaningful. I have helped in producing Oracle's SpecJ results for several years, both with TopLink/EclipseLink JPA on WebLogic and previously with TopLink CMP on OC4J. I can honestly tell you that producing good numbers on the benchmark is not easy, and does require you to significantly optimize your code, and in particular optimize your concurrency and scalability. Having good JPA performance will not guarantee you good results on SpecJ, as there are other components involved, but having poor performance, concurrency or scalability in any component in your stack will prevent you from having good results. SpecJ emulates a real multi-user application, with a separate driver machine (or many) that simulate a large number of concurrent users hitting the system. It is very good at finding any performance, concurrency, or scalability bottlenecks in your stack.

One could also argue that it is very difficult for small JPA providers to publish a SpecJ result. Publishing a result requires having a full JavaEE stack and being integrated with it, and having access to real hardware, and investing the time and having the expertise to analyze and optimize the implementation on the real hardware. Of course, one could also argue if you don't have access to real hardware and real expertise, can you really expect to produce a performant and scalable product?

Ideally if you are looking for the most performant JPA provider for your application, you would benchmark it on your own application, and on your production hardware. Writing a good performance benchmark and tests can be a difficult task. It can be difficult to get consistent results, and easy to misinterpret things. In general, if a result doesn't make sense, there is probably something fishy going on. I have seen many such invalid performance comparisons from our users over the years (and from competitors as well). One that I remember was a user had a test showing TopLink was 100x slower than their JDBC code for a simple query. After looking at the test, I found their JDBC code executed a statement, but did not actually fetch any rows, or even close the statement. Once their JDBC code was fixed to be valid, TopLink was actually faster than their non-optimized JDBC code.

In EclipseLink and Oracle TopLink we take performance and scalability very seriously. In EclipseLink we run a large suite of performance tests every week, and every release to ensure our performance is better than our previous release, and better than any of our main competitors. Although our tests show we have the best performance today, this has not always been the case. We did not write the tests to showcase our performance, but to find problem areas where we were slower than our competitors. We then optimized those areas to ensure we were then faster than our competitors. I'm not claiming EclipseLink, or Oracle TopLink are faster than every other JPA implementation, in every use case. There will be some use cases where we are not faster, and some where our default configuration differs from another JPA provider that makes us slower until the configuration is fixed.

For example, EclipseLink does not automatically join fetch EAGER relationships, we view this as wrong, as EAGER does not mean JOIN FETCH, we have a separate @JoinFetch annotation to specify this. So there are some benchmarks, such as one from an object database that supports JPA that shows EclipseLink slower for ElementCollection mappings, this is solely because of the configuration, and we would be faster with the simple addition of a @JoinFetch.

EclipseLink also does not enable batch writing by default. You can configure this easily in the persistence.xml, but some other JPA providers use batch writing by default on some databases, so this could show a difference. EclipseLink also requires the usage of an agent to enable LAZY and other optimizations, some "benchmarks" fail to use this agent, which will seriously affect EclipseLink's performance.

The raw throughput of JPA is rarely ever the main factor in an application's performance. How optimized the database access is, what features the JPA provider supports, and how well the application makes use of JPA, do have a big impact on an applications performance. EclipseLink and Oracle TopLink have a large feature set of performance and scalability features. I'm not going to go over all of these today, but you can browse my blog for other posts on some of these.

Finally, if you really want a benchmark showing EclipseLink is 15x faster than any other JPA implementation, (even ones that claim this fact), you can find it here. When running this benchmark on my desktop, accessing a local Derby database, it does 1,016,884 queries per second. This is over 15x (88x in fact) faster than other JPA providers, even 145x faster than the self claimed "fastest JPA implementation on the face of the Earth".

Granted EclipseLink contains an integrated, feature rich, object cache, where as other JPA providers do not. So, I also ran the "benchmark" with caching disabled. Even then EclipseLink was 2.7x faster. Of course EclipseLink also supports statement caching, other JPA providers do not, so I also disabled this. Again EclipseLink was 2.2x faster than the self claimed "fastest JPA implementation on the face of the Earth".

Was it a fair, objective benchmark that emulates a real application environment? Probably not... but don't take my, or anyone else's word for it.

Tuesday, May 15, 2012

JPQL vs SQL, why not have both

One of the most common questions I see on JPA, is users wanting to know how to write some specific SQL query as a JPQL query. Some of the time, they just need to learn JPQL, and their SQL can easily be converted to JPQL. Other times their SQL is using one of the many features on SQL that are not provided in JPQL, and their only option is to use a native SQL query.

In EclipseLink 2.4, we have greatly enhanced our JPQL support to support most features of SQL. I refer to EclipseLink's JPQL extensions as the EclipseLink Query Language, or EQL.

The JPQL support in EclipseLink 1.0 followed the JPA 1.0 BNF. It was not until the 2.1 release that we started adding some extensions, and removing restrictions. We introduced FUNC and TREAT in 2.1. FUNC allows calling any database function, and TREAT allows casting to a subclass for entities with inheritance.

In the upcoming 2.4 release several new features have been added: (available for download today here)

  • ON
  • UNION
  • INTERSECT
  • EXCEPT
  • NULLS FIRST/LAST
  • CAST
  • EXTRACT
  • REGEXP
  • FUNCTION
  • OPERATOR
  • SQL
  • COLUMN
  • TABLE

EQL not only allows usage of more of the SQL syntax and functionality, but also allows the mixing of SQL, and SQL constructs within JPQL. EQL provides a hybrid query language that is object-oriented and database platform independent, but can still access raw data and database specific functionality when required.

ON

SQL defines an ON clause to joins, but JPA 2.0 JPQL does not. JPQL does not require an ON clause because when a relationship is joined, the ON clause comes from the join columns already defined in the mapping. Sometimes however it is desirable to append additional conditions to the join condition, normally in the case of outer joins.

EclipseLink supports the ON clause, both for relationships joins, and to define joins between two independent objects.

The JPA 2.1 draft defines an ON clause, but only on relationship joins, not on joins between independent objects.

SELECT e FROM Employee e LEFT JOIN e.address ON a.city = :city

SELECT e FROM Employee e LEFT JOIN MailingAddress a ON e.address = a.address

UNION, INTERSECT, EXCEPT

SQL supports UNION, INTERSECT and EXCEPT, but JPA 2.0 JPQL does not. Most unions can be done in terms of joins, but some can not, and some are more difficult to express using joins.

EclipseLink supports UNION, INTERSECT and EXCEPT, including the ALL option to include duplicates.

SELECT MAX(e.salary) from Employee e where e.address.city = :city1
UNION SELECT MAX(e.salary) from Employee e where e.address.city = :city2

SELECT e from Employee e join e.phones p where p.areaCode = :areaCode1
INTERSECT SELECT e from Employee e join e.phones p where p.areaCode = :areaCode2

SELECT e from Employee e
EXCEPT SELECT e from Employee e WHERE e.salary > e.manager.salary

NULLS FIRST

SQL supports NULLS FIRST, and NULLS LAST ordering options but JPA 2.0 JPQL does not.

EclipseLink supports NULLS FIRST, and NULLS LAST in the ORDER BY clause.

SELECT e FROM Employee e LEFT JOIN e.manager m ORDER BY m.lastName NULLS FIRST

CAST

SQL supports a CAST function to convert between datatypes, JPA 2.0 JPQL does not support this function.

EclipseLink supports CAST allowing any value to be cast to any database type supported by the database.

SELECT CAST(e.salary NUMERIC(10,2)) FROM Employee e

EXTRACT

SQL supports an EXTRACT function for accessing date/time values, JPA 2.0 JPQL does not support any functions for accessing date/time values.

EclipseLink supports EXTRACT allowing any database supported date/time part value to be extracted from the date/time.

SELECT EXTRACT(YEAR, e.startDate) FROM Employee e

REGEXP

Regular expression comparisons are supported by many databases, although there is no standard SQL syntax that has been adopted by major databases yet. JPA 2.0 JPQL does not support regular expressions.

EclipseLink supports REGEXP on Oracle, PostgreSQL, MySQL, MongoDB, and other supporting databases.

SELECT e FROM Employee e WHERE e.lastName REGEXP '^Dr\.*'

FUNC and FUNCTION

SQL supports a lot more database functions than JPQL. Specific database vendors also provide their own set of functions. Users and libraries can also define their own database functions. JPA 2.0 JPQL provides no mechanism to call database specific functions. The JPA 2.1 draft defines a special FUNCTION operator in JPQL to allow calling a specific database function.

EclipseLink 2.1 provided this support using the FUNC operator. EclipseLink 2.4 will also provide the same functionality using the JPA 2.1 FUNCTION operator.

SELECT FUNC('YEAR', e.startDate) AS year, COUNT(e) FROM Employee e GROUP BY year

OPERATOR

Some SQL functions have special syntax such as EXTRACT that takes a date part keyword (YEAR, MONTH, DAY), or CAST that takes a type name (NUMBER(10,2)). The FUNCTION operator cannot be used to call these special functions because of their special syntax. FUNCTION is also database specific, and does not allow the JPA provider to map the function to a different name on a different database platform.

EclipseLink has support for over 80 database functions through defined EclipseLink ExpressionOperators. EclipseLink has always supported these operator using EclipseLink Expression queries, and now supports these operators using the JPQL OPERATOR keyword. OPERATOR allows calling any EclipseLink ExpressionOperator. EclipseLink ExpressionOperators are database independent, in that if a database provides a equivalent function it is used. ExpressionOperators support generating any syntax to allow calling database functions that require special syntax. The list of supported EclipseLink ExpressionOperators is here. Users can also define their own ExpressionOperators.

SELECT e FROM Employee e WHERE OPERATOR('ExtractXml', e.resume, '@years-experience') > 10

SQL

The SQL operator allows for any SQL to be embedded in the JPQL query. This allows a hybridization of JPQL and SQL, giving the advantages of both in the same query. Previously if any part of the query required something not supported by JPQL, the entire query would need to be rewritten as a native SQL query. Now JPQL can still be used, and the SQL operator can be used just for the parts that require SQL. The SQL operator accepts a variable number of arguments, which are translated into the SQL string using the ? marker.
SELECT p FROM Phone p WHERE SQL('CAST(? AS CHAR(3))', e.areaCode) = '613'

SELECT SQL('EXTRACT(YEAR FROM ?)', e.startDate) AS year, COUNT(e) FROM Employee e GROUP BY year

SELECT e FROM Employee e ORDER BY SQL('? NULLS FIRST', e.startDate)

SELECT e FROM Employee e WHERE e.startDate = SQL('(SELECT SYSDATE FROM DUAL)')

COLUMN

The COLUMN operator allows for any unmapped column to be referenced in JQPL. This can be used to access unmapped columns such as foreign key columns, inheritance discriminators, or system columns such as ROWID.
SELECT e FROM Employee e WHERE COLUMN('MANAGER_ID', e) = :id

SELECT e FROM Employee e WHERE COLUMN('ROWID', e) = :id

TABLE

The TABLE operator allows for any unmapped table to be referenced in JQPL. This can be used to access join, collection, history, auditing, or system tables for use in JPQL queries.
SELECT e, a.LAST_UPDATE_USER FROM Employee e, TABLE('AUDIT') a WHERE a.TABLE = 'EMPLOYEE' AND a.ROWID = COLUMN('ROWID', e)

JOIN FETCH

JPQL does not allow using an alias on a JOIN FETCH, and does not allow nested join fetches. EclipseLink allows these.
SELECT e FROM Employee e JOIN FETCH e.address a ORDER BY a.city

SELECT e FROM Employee e JOIN FETCH e.manager m JOIN FETCH m.manager

Criteria API

You may be asking yourself, this is all great, but what about the Criteria API, can these extensions be used with that?

EclipseLink 2.4 will provide a JpaCriteriaBuilder interface that allows access to EclipseLink specific functionality. Both Criteria queries and JPQL queries get translated to EclipseLink Expression queries, before being translated to SQL. EclipseLink Expressions support all of the above functionality through their API. JpaCriteriaBuilder defines two API toExpression() and fromExpression() to create Criteria Expression objects from EclipseLink Expression objects.

JpaCriteriaBuilder cb = (JpaCriteriaBuilder)em.getCriteriaBuilder();
CriteriaQuery query = cb.createQuery(Employee.class);
Root emp = query.from(Employee.class);
query.where(cb.fromExpression(cb.toExpression(emp).get("firstName").regexp("^Dr\.*")));

Summary

EQL offers a lot of functionality to make querying easier and more powerful. There are still a few features of SQL that would be difficult to expression with EQL, but the majority of SQL is feasible. Of coarse, there is nothing in JPA that forces you to use JPQL, and if you love SQL, you are still free to use native SQL queries.

Monday, April 23, 2012

Objects vs Data, and Filtering a JOIN FETCH

JPA deals with objects, and SQL deals with rows. Some developers love JPA because it allows them to use objects in an object-oriented fashion. However, other developers have trouble understanding JPA precisely because it is object-oriented.

Even though object-oriented programming has been popular for several decades, there are still a lot of procedural languages out their. Developers used to procedural programming have a different mindset than object-oriented developers. This is also true of developers with a lot of experience with SQL. JPA and JPQL can be difficult to understand for developers used to SQL and JDBC.

This can lead to some odd usages and misunderstandings of JPA. In this blog post, I would like to highlight a couple of these misunderstandings, and provide a solution to my favorite, which I call Filtering a JOIN FETCH.

Dobject Model

One JPA usage that I find aggravating is what I call the Dobject model. This is a data model that has been made into an object model. Sometimes this comes across as a class that has the same name as the database table, with all of the same field names, and no relationships. Sometimes there are relationships, but they have the same name as the foreign key columns.

@Entity
public class EMP {
  @Id
  long EMP_ID;
  String F_NAME;
  String L_NAME;
  long MGR_ID;
  long ADDR_ID;
}

The above is an unusual class, and not very object-oriented. It is probably not as useful as it could if it had relationships instead of foreign keys.

@Entity
public class employees {
  @Id
  long empId;
  String fName;
  String lName;
  @ManyToOne
  Employee mgrId;
  @OneToOne
  Address addrId;
}

This one is very confused. First of all, it seems to be named after its table, where the name employees might make sense, but an object is a single entity, so should not be pluralized. Also, classes in Java should start with an upper case letter, as classes are proper names of the real world entity that they represent.

This class at least has relationships, which I suppose is an improvement, but they are named like they are foreign keys, not objects. This normally leads the user to try to query them as values instead of as objects.

Select e from employees e where e.mgrId = 4

Which does not work, because mgrId is an Employee object, not an Id value. The relationship should be named after what it represents, i.e. manager not the foreign key.

A more object-oriented way to define the class would be:

@Entity
public class Employee {
  @Id
  long id;
  String firstName;
  String lastName;
  @ManyToOne(fetch=FetchType.LAZY)
  Employee manager;
  @OneToMany(mappedBy="manager")
  List<Employee> managedEmployees;
  @OneToOne
  Address address;
}

How not to write a DAO

In JPA you do not normally execute queries to insert, update or delete objects. To update and object you just find or query it, change it through its set methods, and commit the transaction. JPA automatically keeps track of what changed and updates what is required to be updated, in the order that it is required to be updated.

To insert an object you call persit() on the EntityManager. To delete an object you call remove() on the EntityManager. This is different than SQL or JDBC that requires you to execute a query to perform any modification.

JPA does allow UPDATE and DELETE queries through JPQL. These are for batch updates and deletes, not for the deletion or updating or single objects. This can lead to the very confused Data Access Object below:

public class EmployeeDOA {
  public void insert(Employee employee) {
    em.persist(employee);
  }
  public void update(Employee employee) {
    Query query = em.createQuery("Update Employee e set e.firstName = :firstName, e.lastName = :lastName where e.id = :id");
    query.setParameter("id", employee.getId());
    query.setParameter("firstName", employee.getFirstName());
    query.setParameter("lastName", employee.getLastName());
    query.executeUpdate();
  }
  public void delete(Employee employee) {
    Query query = em.createQuery("Delete from Employee e where e.id = :id");
    query.setParameter("id", employee.getId());
    query.executeUpdate();
  }
}

This is wrong. You do not execute queries to update or delete objects. This will leave the objects in your persistence context in an invalid state, as they are not aware of the query updates. In is also not using JPA correctly, or as it was intended, and not benefiting from its full functionality. A better Data Access Object would be:
public class EmployeeDOA {
  public void persist(Employee employee) {
    em.persist(employee);
  }
  public Object merge(Employee employee) {
    return em.merge(employee);
  }
  public void remove(Employee employee) {
    em.remove(employee);
  }
}

Note that there is no update(). JPA does not have or require and update(), merge() can be used for detached objects, but is not the equivalent of update(), you do not need to call update in JPA, this is one of its benefits.

JPQL vs SQL

JPQL is not SQL. It looks a lot like SQL, has similar syntax and uses the same standard naming for operators and functions, but it is not SQL. This can be very confusing for someone experienced with SQL. When they try to use their SQL in place of JPQL it does not work.

Select * from Employee e join Address a on e.addressId = a.id where a.city = 'Ottawa'

This is SQL, not JPQL.
The equivalent JPQL would be:
Select e from Employee e where e.address.city = 'Ottawa'

Of coarse if you prefer SQL, JPA fully allows you to use SQL, you just need to call createNativeQuery instead of createQuery. However, most users prefer to use JPQL. I suppose this is because JPQL lets them deal with objects, and even if they don't quite understand objects, they do understand there is some benefit there.

JPQL also defines the JOIN syntax, but it does not have a ON clause, and JOIN is based on relationships, not foreign keys.

Select e from Employee e join e.address a where a.city = 'Ottawa'

The JOIN syntax in JPQL allows you do query collection relationships, and use OUTER joins and FETCH. A join FETCH allows you read an object an its relationship in a single query (as appose to a possible dreaded N+1 queries).

Select e from Employee e left join fetch e.address where e.address.city = 'Ottawa'

Notice that I did not use an alias on this JOIN FETCH. This is because JPQL does not allow this, which I will get into later. There is also no ON clause in JPQL because the relationship is always joined by the foreign keys defined in the relationship's join columns in its mapping.

Sometimes it is desirable to place additional conditions in a JOIN ON clause. This is normally in the case of OUTER joins, where placing the condition in the WHERE clause would result in empty joins being filtered. A ON clause is something that is part of the JPA 2.1 draft, so it is coming. EclipseLink already supports the ON clause, as well as aliasing JOIN FETCH in its 2.4 development milestones, see:

www.eclipse.org/eclipselink/downloads/milestones

Filtering a JOIN FETCH

A common misunderstanding I see users make occurs when querying a OneToMany relationship.
Select d from Department d join d.employees e where e.address.city = 'Ottawa'

This query results in all department objects that have any employee living in Ottawa. The confusion comes when they access the department's employees. Each department contains all of its employees, however they were expecting just the employees that live in Ottawa.

This is because JPA deals with objects, and a Department object represents a specific department, and has a specific identity. If I issue two queries for a particular department, I get back the same identical (==) instance (provided both queries use the same EntityManager). A specific department always has the same employees, it represent the real world department, and does not change, just because you queried it differently. This is important for caching, but also within the same persistence context, if you query the same department, two different ways, you should also get back the same exact department.

If you really want the department, and only the employees of the department that live in Ottawa, you can use the following query:
Select d, e from Department d join d.employees e where e.address.city = 'Ottawa'

This will give you an List of Object[] that contain the Department and the Employee. For each employee you will get back n Object[] (rows), where n is the number of departments in the employee. The same employee will be duplicated n times, each with its different department. If you access the departments any of the employees they will contain all of the employee's departments, not just the ones in Ottawa.

But, if you really, really want the department to only have the employees that live it Ottawa, you can do this. I'm not sure I would recommend it, but it is possible, at least in EclipseLink 2.4 it will be. EclipseLink allows you to use an alias on a JOIN FETCH. This support was intended for OneToOne and ManyToOne relationships, to avoid having to join it twice just to get an alias, as well as to allow using it in an ORDER BY or in other ways that would not filter the results and alter how the objects are built. But, there is nothing stopping you from using it with a OneToMany to filter the contents of the fetched OneToMany results.

If you are going to do this, you should be careful, and at least ensure you set the "javax.persistence.cache.storeMode" and "javax.persistence.cache.retrieveMode" to BYPASS, to avoid corrupting the shared cache.

Query query = em.createQuery("Select d from Department d join fetch d.employees e where e.address.city = 'Ottawa'");
query.setHint("javax.persistence.cache.storeMode", "BYPASS");
query.setHint("javax.persistence.cache.retrieveMode", "BYPASS");
List departmentsWithFilteredEmployees = query.getResultList();

EclipseLink 2.4 is currently under development, but its milestone builds already contain this functionality. The EclipseLink 2.4 milestone builds can be download from:
www.eclipse.org/eclipselink/downloads/milestones

For more information of the many JPQL extensions and enhancements in EclipseLink 2.4 see:
wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Querying/JPQL

Monday, April 2, 2012

EclipseLink JPA supports MongoDB

EclipseLink 2.4 will support JPA access to NoSQL databases. This support is already part of the EclipseLink development trunk and can be tried out using the milestone or nightly builds.  Initial support is provided for MongoDB and Oracle NoSQL. A plug-able platform and adapter layer allows for other databases to be supported.

NoSQL is a classification of database systems that do not conform to the relational database or SQL standard. They have various roots, from distributed internet databases, to object databases, XML databases and even legacy databases. They have become recently popular because of their use in large scale distributed databases in Google, Amazon, and Facebook.

There are various NoSQL databases including:
  • Mongo DB
  • Oracle NoSQL
  • Cassandra
  • Google BigTable
  • Couch DB

EclipseLink's NoSQL support allows the JPA API and JPA annotations/xml to be used with NoSQL data. EclipseLink also supports several NoSQL specific annotations/xml including @NoSQL that defines a class to map NoSQL data.

EclipseLink's NoSQL support is based on previous EIS support offered since EclipseLink 1.0. EclipseLink's EIS support allowed persisting objects to legacy and non-relational databases. EclipseLink's EIS and NoSQL support uses the Java Connector Architecture (JCA) to access the data-source similar to how EclipseLink's relational support uses JDBC. EclipseLink's NoSQL support is extendable to other NoSQL databases, through the creation of an EclipseLink EISPlatform class and a JCA adapter.

Let's walk through an example of using EclipseLink's NoSQL support to persist an ordering system's object model to a MongoDB database.

The source for the example can be found here, or from the EclipseLink SVN repository.

Ordering object model

The ordering system consists of four classes, Order, OrderLine, Address and Customer. The Order has a billing and shipping address, many order lines, and a customer.

public class Order implements Serializable {
private String id;
private String description;
private double totalCost = 0;
private Address billingAddress;
private Address shippingAddress;
private List orderLines = new ArrayList();
private Customer customer;
...
}
public class OrderLine implements Serializable {
private int lineNumber;
private String description;
private double cost = 0;
...
}
public class Address implements Serializable {
private String street;
private String city;
private String province;
private String country;
private String postalCode;
....
}
public class Customer implements Serializable {
private String id;
private String name;
...
}

Step 1 : Decide how to store the data

There is no standard on how NoSQL databases store their data. Some NoSQL databases only support key/value pairs, others support structured hierarchical data such as JSON or XML.

MongoDB stores data as BSON (binary JSON) documents. The first decision that must be made is how to store the objects. Normally each independent object would compose a single document, so a single document could contain Order, OrderLine and Address. Since customers can be shared amongst multiple orders, Customer would be its own document.

Step 2 : Map the data

The next step is to map the objects. Each root object in the document will be mapped as an @Entity in JPA. The objects that are stored by being embedded within their parent's document are mapped as @Embeddable. This is similar to how JPA maps relational data, but in NoSQL embedded data is much more common because of the hierarchical nature of the data format. In summary, Order and Customer are mapped as @Entity, OrderLine and Address are mapped as @Embeddable.

The @NoSQL annotation is used to map NoSQL data. This tags the classes as mapping to NoSQL data instead of traditional relational data. It is required in each persistence class, both entities and embeddables. The @NoSQL annotation allows the dataType and the dataFormat to be set.

The dataType is the equivalent of the table in relational data, its meaning can differ depending on the NoSQL data-source being used. With MongoDB the dataType refers to the collection used to store the data. The dataType is defaulted to the entity name (as upper case), which is the simple class name.

The dataFormat depends on the type of data being stored. Three formats are supported by EclipseLink, XML, Mapped, and Indexed. XML is the default, but since MongoDB uses BSON, which is similar to a Map in structure, Mapped is used. In summary, each class requires the @NoSql(dataFormat=DataFormatType.MAPPED) annotation.

@Entity
@NoSql(dataFormat=DataFormatType.MAPPED)
public class Order

@Embeddable
@NoSql(dataFormat=DataFormatType.MAPPED)
public class OrderLine

Step 3 : Define the Id

JPA requires that each Entity define an Id. The Id can either be a natural id (application assign id) or a generated id (id is assign by EclipseLink). MongoDB also requires an _id field in every document. If no _id field is present, then Mongo will auto generate and assign the _id field using an OID (object identifier) which is similar to a UUID (universally unique identifier).

You are free to use any field or set of fields as your Id in EclipseLink with NoSQL, the same as a relational Entity. To use an application assigned id as the Mongo id, simply name its field as "_id". This can be done through the @Field annotation, which is similar to the @Column annotation (which will also work), but without all of the relational details, it has just a name. So, to define the field Mongo will use for the id include @Field(name="_id") in your mapping.

To use the generated Mongo OID as your JPA Id, simply include @Id, @GeneratedValue, and @Field(name="_id") in your object's id field mapping. The @GeneratedValue tells EclipseLink to use the Mongo OID to generate this id value. @SequenceGenerator and @TableGenerator are not supported in MongoDB, so these cannot be used. Also the generation types of IDENTITY, TABLE and SEQUENCE are not supported. You can use the EclipseLink @UUIDGenerator if you wish to use a UUID instead of the Mongo OID. You can also use your own custom generator. The id value for a Mongo OID or a UUID is not a numeric value, it can only be mapped as String or byte[].

@Id
@GeneratedValue
@Field(name="_id")
private String id;

Step 4 : Define the mappings

Each attribute in your object has too be mapped. If no annotation/xml is defined for the attribute, then it mapping will be defaulted. Defaulting rules for NoSQL data, follow the JPA defaulting rules, so most simple mappings do not require any configuration if defaults are used. The field names used in the Mongo BSON document will mirror the object attribute names (as uppercase). To provide a different BSON field name, the @Field annotation is used.

Any embedded value stored in the document is persisted using the @Embedded JPA annotation. An embedded collection will use the JPA @ElementCollection annotation. The @CollectionTable of the @ElementCollection is not used or supported in NoSQL, as the data is stored within the document, no separate table is required. The @AttributeOverride is also not required nor supported with NoSQL, as the embedded objects are nested in the document, and do not require unique field names. The @Embedded annoation/xml is normally not required, as it is defaulted, the @ElementCollection is required, as defaulting does not currently work for @ElementCollection in EclipseLink.

The relationship annotations/xml @OneToOne, @ManyToOne, @OneToMany, and @ManyToMany are only to be used with external relationships in NoSQL. Relationships within the document use the embedded annotations/xml. External relationships are supported to other documents. To define an external relationship a foreign key is used. The id of the target object is stored in the source object's document. In the case of a collection, a collection of ids is stored. To define the name of the foreign key field in the BSON document the @JoinField annotation/xml is used.

The mappedBy option on relationships is not supported for NoSQL data, for bi-directional relationships, the foreign keys would need to be stored on both sides. It is also possible to define a relationship mapping using a query, but this is not currently supported through annotations/xml, only through a DescriptorCustomizer.

@Basic
private String description;
@Basic
private double totalCost = 0;
@Embedded
private Address billingAddress;
@Embedded
private Address shippingAddress;
@ElementCollection
private List orderLines = new ArrayList();
@ManyToOne(fetch=FetchType.LAZY)
private Customer customer;

Step 5 : Optimistic locking

Optimistic locking is supported with MongoDB. It is not required, but if locking is desired, the @Version annotation can be used.

Note that MongoDB does not support transactions, so if a lock error occurs during a transaction, any objects that have been previously written will not be rolled back.

@Version
private long version;

Step 6 : Querying

MongoDB has is own JSON based query by example language. It does not support SQL (i.e. NoSQL), so querying has limitations.

EclipseLink supports both JPQL and the Criteria API on MongoDB. Not all aspects of JPQL are supported. Most basic operations are supported, but joins are not supported, nor sub-selects, group bys, or certain database functions. Querying to embedded values, and element collections are supported, as well as ordering, like, and selecting attribute values.

Not all NoSQL database support querying, so EclipseLink's NoSQL support only supports querying if the NoSQL platform supports it.

Query query = em.createQuery("Select o from Order o where o.totalCost > 1000");
List orders = query.getResultList();

Query query = em.createQuery("Select o from Order o where o.description like 'Pinball%'");
List orders = query.getResultList();

Query query = em.createQuery("Select o from Order o join o.orderLines l where l.description = :desc");
query.setParameter("desc", "shipping");
List orders = query.getResultList();

Native queries are also supported in EclipseLink NoSQL. For MongoDB the native query is in MongoDB's command language.

Query query = em.createNativeQuery("db.ORDER.findOne({\"_id\":\"" + oid + "\"})", Order.class);
Order order = (Order)query.getSingleResult();

Step 7 : Connecting

The connection to a Mongo database is done through the JPA persistence.xml properties. The "eclipselink.target-database" property must define the Mongo platform "org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform". A connection spec must also be defined through "eclipselink.nosql.connection-spec" to be "org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec". Other properties can also be set such as the "eclipselink.nosql.property.mongo.db", "eclipselink.nosql.property.mongo.host" and "eclipselink.nosql.property.mongo.port". The host and port can accept a comma separated list of values to connect to a cluster of Mongo databases.

<persistence-unit name="mongo-example" transaction-type="RESOURCE_LOCAL">
<class>model.Order</class>
<class>model.OrderLine</class>
<class>model.Address</class>
<class>model.Customer</class>
<properties>
<property name="eclipselink.target-database" value="org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform">
<property name="eclipselink.nosql.connection-spec" value="org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec">
<property name="eclipselink.nosql.property.mongo.port" value="27017">
<property name="eclipselink.nosql.property.mongo.host" value="localhost">
<property name="eclipselink.nosql.property.mongo.db" value="mydb">
<property name="eclipselink.logging.level" value="FINEST">
</property>
</property>

Summary

The full source code to this demo is available from SVN.

To run the example you will need a Mongo database, which can be downloaded from, http://www.mongodb.org/downloads.

EclipseLink also support NoSQL access to other data-sources including:
  • Oracle NoSQL
  • XML files
  • JMS
  • Oracle AQ

Thursday, March 29, 2012

NoSQL

In the beginning data was free and wild. It was not confined to rows and columns and not bounded to standardization. Data access was unruly and proprietary. These were the first "NoSQL" databases. They consisted of flat file, hierarchical and network databases such as VSAM, IMS and ADABASE.

Then there was SQL, and things were good.

SQL was developed during the golden age of data in the 1970s. Database access became standardized through the SQL language and the relational model. The 1970s saw the birth of relational database products such as RDBMS, Ingres, Oracle and DB2. The 1980s saw ANSI standardization of the SQL language, and the adoption of client-server computing.

However, the legacy databases still existed, as well as the legacy applications that accessed them. New applications needed to access the old data, and this was in general a very painful experience.

Back in the good old Smalltalk days during the 1990s, Smalltalk was unofficially adopted as the programming language of choice for large corporate projects. It was the beginning of the commercial adoption of object-oriented programming, both Smalltalk, C++ and other OO languages. Things were great, but there was a dark side. All of the data was stored in relational databases, or worse legacy mainframe databases. Fitting round objects into square relational tables was difficult and cumbersome. Two solutions emerged, object-oriented databases, and object-relational mapping.

New commercial object-oriented database management systems (OODBMS) emerged in the 1990s including Versant, Gemstone and ObjectStore. They were integrated with their respective languages, Smalltalk and C++, and stored data as it was represented in memory, instead of in the relational model. These were the 2nd generation of "NoSQL" databases. There was little standardization and solutions were mainly proprietary. Access to the data from non object-oriented languages was difficult. The world did not adopt this new model, as it had previously adopted the relational model. The worlds data remained in the trusted, standardized and universally accessible relational model.

Object-relational mapping allowed objects to be used in the programming model, but have them converted to relational rows and SQL when persisted. A lot of OR mapping frameworks were built, including many corporate in-house solutions. TopLink, a product from The Object People became the leading OR mapper in the Smalltalk language. In C++ there was Persistence, as well as various other products in various languages.

Although the relational model was the industry standard for any new applications, much of the worlds data remained in mainframe databases. The data was slowly being migrated, but most corporations still had mainframe data. Consulting at TopLink clients in the 90s I found most clients were building applications on relational database, but still had to get some data from the mainframe. This is when we created the first version of TopLink's "NoSQL" support. Of coarse NoSQL was not a buzz word at the time, so the offering was called TopLink for the Mainframe. The main problem was that everyone's mainframe data and access was different, so the product involved lots of consulting.

When Java came along, TopLink moved from Smalltalk to Java. OR mapping became very popular in Java and many new products came to market. The first real OR standard came in the form on EJB CMP. It had is "issues" to say the least, and was coupled with the J2EE platform. A new competing standard of JDO was created in retaliation to CMP. To reconcile the issue of having two competing Java standards, JPA was created to replace them both, and was adopted by most OR mapping products.

In response to the popularity of object-oriented computing, the relational database vendors created the object-relational paradigm. This allowed storage of structured object types and collections in relational tables. SQL3 (SQL 1999) defined new query syntax to access this data. Despite some initial hype, the object-relational paradigm was not successful, and although the features remain in Oracle, DB2 and Postgres, the world stayed with the trusted relational model.

The panic around Y2K had the good fortune of getting most corporations and governments off mainframe databases, and into relational databases. Some legacy data still remained, so we also offered TopLink for the Mainframe in Java. At that time the Internet was taking off, and XML was becoming popular. Since XML is hierarchical data that you could convert any mainframe data to, it became part of our solution for accessing legacy data and the TopLink SDK was born.

With the explosion of the Internet, XML was becoming increasingly popular. This lead to once again the questioning of the relational model, and the creation of XML databases (the 3rd generation of NoSQL). There were several XML databases that achieved much hype, but limited market success. The relational database vendors responded by adding XML support for storage of XML in relational tables.

Again the world stayed with the relational model.

The TopLink SDK also provided simple object to XML mapping, perhaps the first such product to do so. As XML usage in Java became mainstream, the TopLink SDK was split into two products. TopLink Moxy become TopLink's object to XML mapping solution. TopLink EIS became TopLink's legacy data persistence solution.

Around 2009 the term NoSQL was used to categories the new distributed databases being used at Google, Amazon and Facebook. The databases categorized themselves as
being highly scalable, not adhering to ACID transaction semantics, and having limited querying. The NoSQL term grew to include the various other non-relational databases that have emerged throughout the ages.

Is the relational model dead? Will the world switch to the NoSQL model, and will data once again be free? Only time will tell. If history teaches us anything, one would expect the relational model to persist. NoSQL has already been renamed in some circles to "Not Only SQL", to leave room for the NoSQL databases to support the SQL standard. In fact, some NoSQL databases already have support for JDBC drivers. My intuition is a union of the two models, perhaps this has already begun with some NoSQL databases adding SQL support, and some relational databases extended their clustering support such as MySQL cluster.

EclipseLink 2.4 will contain JPA support for NoSQL databases. Initial support with include MongoDB and Oracle NoSQL. This support is already available in the EclipseLink nightly builds. Technically, this is not new functionality, as EclipseLink (formerly TopLink) has been supporting non-relational data for over a decade, but the JPA support is new.

In the upcoming months I will be blogging about some of the new features in EclipseLink to support NoSQL. This blog post is solely an introduction, so sorry to those expecting hard content.