The object/relational paradigm mismatch can be broken into several parts, which we examine one at a time. Let’s start our exploration with a simple example that is problem free. As we build on it, you’ll see the mismatch begin to appear.
Suppose you have to design and implement an online e-commerce application. In this application, you need a class to represent information about a user of the system, and you need another class to represent information about the user’s billing details, as shown in figure 1.1.
In this diagram, you can see that a User has many BillingDetails. You can navigate the relationship between the classes in both directions; this means you can iterate through collections or call methods to get to the “other” side of the relationship. The classes representing these entities may be extremely simple:
public class User {
String username;
String address;
Set billingDetails;
// Accessor methods (getter/setter), business methods, etc.
}
public class BillingDetails {
String account;
String bankname;
User user;
// Accessor methods (getter/setter), business methods, etc.
}
Note that you’re only interested in the state of the entities with regard to persistence, so we’ve omitted the implementation of property accessors and business methods, such as getUsername() or billAuction( ).
It’s easy to come up with an SQL schema design for this case:
create table USERS (
USERNAME varchar(15) not null primary key,
ADDRESS varchar(255) not null
);
create table BILLINGDETAILS (
ACCOUNT varchar(15) not null primary key,
BANKNAME varchar(255) not null,
USERNAME varchar(15) not null, foreign key (USERNAME) references USERS
);
The foreign key-constrained column USERNAME in BILLINGDETAILS represents the relationship between the two entities. For this simple domain model, the object/relational mismatch is barely in evidence; it’s straightforward to write JDBC code to insert, update, and delete information about users and billing details.
Now let’s see what happens when you consider something a little more realistic. The paradigm mismatch will be visible when you add more entities and entity relationships to your application.
1 - The problem of granularity :
The most glaringly obvious problem with the current implementation is that you’ve designed an address as a simple String value. In most systems, it’s necessary to store street, city, state, country, and ZIP code information separately. Of course, you could add these properties directly to the User class, but because it’s highly likely that other classes in the system will also carry address information, it makes more sense to create an Address class. Figure 1.2 shows the updated model.
Should you also add an ADDRESS table? Not necessarily; it’s common to keep address information in the USERS table, in individual columns. This design is likely to perform better, because a table join isn’t needed if you want to retrieve the user and address in a single query. The nicest solution may be to create a new SQL data type to represent addresses, and to add a single column of that new type in the USERS table instead of several new columns.
You have the choice of adding either several columns or a single column (of a new SQL data type). This is clearly a problem of granularity. Broadly speaking, granularity refers to the relative size of the types you’re working with.
Let’s return to the example. Adding a new data type to the database catalog, to store Address Java instances in a single column, sounds like the best approach:
create table USERS (
USERNAME varchar(15) not null primary key,
ADDRESS address not null
);
A new Address type (class) in Java and a new ADDRESS SQL data type should guarantee interoperability. But you’ll find various problems if you check the support for user-defined data types (UDTs) in today’s SQL database management systems.
UDT support is one of a number of so-called object-relational extensions to traditional SQL. This term alone is confusing, because it means the database management system has (or is supposed to support) a sophisticated data type system—something you take for granted if somebody sells you a system that can handle data in a relational fashion. Unfortunately, UDT support is a somewhat obscure feature of most SQL DBMSs and certainly isn’t portable between different products. Furthermore, the SQL standard supports user-defined data types, but poorly.
This limitation isn’t the fault of the relational data model. You can consider the failure to standardize such an important piece of functionality as fallout from the object-relational database wars between vendors in the mid-1990s. Today, most engineers accept that SQL products have limited type systems—no questions asked. Even with a sophisticated UDT system in your SQL DBMS, you would still likely duplicate the type declarations, writing the new type in Java and again in SQL. Attempts to find a better solution for the Java space, such as SQLJ, unfortunately, have not had much success. DBMS products rarely support deploying and executing Java classes directly on the database, and if support is available, it’s typically limited to very basic functionality and complex in everyday usage.
For these and whatever other reasons, use of UDTs or Java types in an SQL database isn’t common practice in the industry at this time, and it’s unlikely that you’ll encounter a legacy schema that makes extensive use of UDTs. You therefore can’t and won’t store instances of your new Address class in a single new column that has the same data type as the Java layer.
The pragmatic solution for this problem has several columns of built-in vendor-defined SQL types (such as Boolean, numeric, and string data types). You usually define the USERS table as follows:
create table USERS (
USERNAME varchar(15) not null primary key,
ADDRESS_STREET varchar(255) not null,
ADDRESS_ZIPCODE varchar(5) not null,
ADDRESS_CITY varchar(255) not null
);
Classes in the Java domain model come in a range of different levels of granularity: from coarse-grained entity classes like User, to finer-grained classes like Address, down to simple SwissZipCode extending AbstractNumericZipCode (or whatever your desired level of abstraction is). In contrast, just two levels of type granularity are visible in the SQL database: relation types created by you, like USERS and BILLINGDETAILS, and built-in data types such as VARCHAR, BIGINT, or TIMESTAMP.
Many simple persistence mechanisms fail to recognize this mismatch and so end up forcing the less flexible representation of SQL products on the object-oriented model, effectively flattening it.
It turns out that the granularity problem isn’t especially difficult to solve. We probably wouldn’t even discuss it, were it not for the fact that it’s visible in so many existing systems. We describe the solution to this problem in section 4.1.
A much more difficult and interesting problem arises when we consider domain models that rely on inheritance, a feature of object-oriented design you may use to bill the users of your e-commerce application in new and interesting ways.
2 - The problem of subtypes :
In Java, you implement type inheritance using superclasses and subclasses. To illustrate why this can present a mismatch problem, let’s add to your e-commerce application so that you now can accept not only bank account billing, but also credit and debit cards. The most natural way to reflect this change in the model is to use inheritance for the BillingDetails superclass, along with several concrete subclasses: CreditCard, BankAccount, and so on. Each of these subclasses defines slightly different data (and completely different functionality that acts on that data). The UML class diagram in figure 1.3 illustrates this model.
What changes must you make to support this updated Java class structure? Can you create a table CREDITCARD that extends BILLINGDETAILS? SQL database products don’t generally implement table inheritance (or even data type inheritance), and if they do implement it, they don’t follow a standard syntax and might expose us to data integrity problems (limited integrity rules for updatable views).
We aren’t finished with inheritance. As soon as we introduce inheritance into the model, we have the possibility of polymorphism.
The User class has an association to the BillingDetails superclass. This is a polymorphic association. At runtime, a User instance may reference an instance of any of the subclasses of BillingDetails. Similarly, you want to be able to write polymorphic queries that refer to the BillingDetails class, and have the query return instances of its subclasses.
SQL databases also lack an obvious way (or at least a standardized way) to represent a polymorphic association. A foreign key constraint refers to exactly one target table; it isn’t straightforward to define a foreign key that refers to multiple tables. You’d have to write a procedural constraint to enforce this kind of integrity rule.
The result of this mismatch of subtypes is that the inheritance structure in a model must be persisted in an SQL database that doesn’t offer an inheritance mechanism. In chapter 6, we discuss how ORM solutions such as Hibernate solve the problem of persisting a class hierarchy to an SQL database table or tables, and how polymorphic behavior can be implemented. Fortunately, this problem is now well understood in the community, and most solutions support approximately the same functionality.
The next aspect of the object/relational mismatch problem is the issue of object identity. You probably noticed that the example defined USERNAME as the primary key of the USERS table. Was that a good choice? How do you handle identical objects in Java?
3 - The problem of identity :
Although the problem of identity may not be obvious at first, you’ll encounter it often in your growing and expanding e-commerce system, such as when you need to check whether two instances are identical. There are three ways to tackle this problem: two in the Java world and one in your SQL database. As expected, they work together only with some help.
■ Instance identity (roughly equivalent to memory location, checked with a == b)
■ Instance equality, as determined by the implementation of the equals( ) method (also called equality by value)
■ Instance equality, as determined by the implementation of the equals( ) method (also called equality by value)
On the other hand, the identity of a database row is expressed as a comparison of primary key values. As you’ll see in section 10.1.2, neither equals() nor == is always equivalent to a comparison of primary key values. It’s common for several non-identical instances in Java to simultaneously represent the same row of the database—for example, in concurrently running application threads. Furthermore, some subtle difficulties are involved in implementing equals() correctly for a persistent class and understanding when this might be necessary.
Let’s use an example to discuss another problem related to database identity. In the table definition for USERS, USERNAME is the primary key. Unfortunately, this decision makes it difficult to change a user’s name; you need to update not only the row in USERS, but also the foreign key values in (many) rows of BILLINGDETAILS. We also discuss what makes a good primary key. A surrogate key column is a primary key column with no meaning to the application user—in other words, a key that isn’t presented to the application user. Its only purpose is identifying data inside the application.
For example, you may change your table definitions to look like this:
create table USERS (
ID bigint not null primary key,
USERNAME varchar(15) not null unique,
);
create table BILLINGDETAILS (
ID bigint not null primary key,
ACCOUNT varchar(15) not null,
BANKNAME varchar(255) not null,
USER_ID bigint not null,
foreign key (USER_ID) references USERS
);
The ID columns contain system-generated values. These columns were introduced purely for the benefit of the data model, so how (if at all) should they be represented in the Java domain model? We discuss this question in section 4.2, and we find a solution with ORM.
In the context of persistence, identity is closely related to how the system handles caching and transactions. Different persistence solutions have chosen different strategies, and this has been an area of confusion. We cover all these interesting topics— and show how they’re related—in section 10.1.
So far, the skeleton e-commerce application you’ve designed has exposed the paradigm mismatch problems with mapping granularity, subtypes, and identity. You’re almost ready to move on to other parts of the application, but first we need to discuss the important concept of associations : how the relationships between entities are mapped and handled. Is the foreign key constraint in the database all you need?
4 - Problems relating to associations :
In your domain model, associations represent the relationships between entities. The User, Address, and BillingDetails classes are all associated; but unlike Address, BillingDetails stands on its own. BillingDetails instances are stored in their own table. Association mapping and the management of entity associations are central concepts in any object persistence solution.
Object-oriented languages represent associations using object references ; but in the relational world, a foreign key-constrained column represents an association, with copies of key values. The constraint is a rule that guarantees integrity of the association. There are substantial differences between the two mechanisms.
Object references are inherently directional; the association is from one instance to the other. They’re pointers. If an association between instances should be navigable in both directions, you must define the association twice, once in each of the associated classes. You’ve already seen this in the domain model classes:
public class User {
Set billingDetails;
}
public class BillingDetails {
User user;
}
Navigation in a particular direction has no meaning for a relational data model because you can create arbitrary data associations with join and projection operators. The challenge is to map a completely open data model, which is independent of the application that works with the data, to an application-dependent navigational model—a constrained view of the associations needed by this particular application.
Java associations can have many-to-many multiplicity. For example, the classes could look like this:
public class User {
Set billingDetails;
}
public class BillingDetails {
Set users;
}
But the foreign key declaration on the BILLINGDETAILS table is a many-to-one association: each bank account is linked to a particular user. Each user may have multiple linked bank accounts.
If you wish to represent a many-to-many association in an SQL database, you must introduce a new table, usually called a link table. In most cases, this table doesn’t appear anywhere in the domain model. For this example, if you consider the relationship between the user and the billing information to be many-to-many, you define the link table as follows:
create table USER_BILLINGDETAILS (
create table USER_BILLINGDETAILS (
USER_ID bigint,
BILLINGDETAILS_ID bigint,
primary key (USER_ID, BILLINGDETAILS_ID), foreign key (USER_ID) references USERS,
foreign key (BILLINGDETAILS_ID) references BILLINGDETAILS
);
You no longer need the USER_ID foreign key column and constraint on the BILLING-DETAILS table; this additional table now manages the links between the two entities. We discuss association and collection mappings in detail in chapter 7.
So far, the issues we’ve considered are mainly structural : you can see them by considering a purely static view of the system. Perhaps the most difficult problem in object persistence is a dynamic problem: how data is accessed at runtime.
5 - The problem of data navigation :
There is a fundamental difference in how you access data in Java and in a relational database. In Java, when you access a user’s billing information, you call someUser.getBillingDetails().iterator().next() or something similar. This is the most natural way to access object-oriented data, and it’s often described as walking the object network. You navigate from one instance to another, even iterating collections, following prepared pointers between classes. Unfortunately, this isn’t an efficient way to retrieve data from an SQL database.
The single most important thing you can do to improve the performance of data access code is to minimize the number of requests to the database. The most obvious way to do this is to minimize the number of SQL queries. (Of course, other, more sophisticated, ways—such as extensive caching—follow as a second step.)
Therefore, efficient access to relational data with SQL usually requires joins between the tables of interest. The number of tables included in the join when retrieving data determines the depth of the object network you can navigate in memory. For example, if you need to retrieve a User and aren’t interested in the user’s billing information, you can write this simple query:
select * from USERS u where u.ID = 123
On the other hand, if you need to retrieve a User and then subsequently visit each of the associated BillingDetails instances (let’s say, to list all the user’s bank accounts), you write a different query:
select * from USERS u
left outer join BILLINGDETAILS bd on bd.USER_ID = u.ID where u.ID = 123
As you can see, to use joins efficiently you need to know what portion of the object network you plan to access when you retrieve the initial instance before you start navigating the object network! Careful, though: if you retrieve too much data (probably more than you might need), you’re wasting memory in the application tier. You may also overwhelm the SQL database with huge Cartesian product result sets. Imagine retrieving not only users and bank accounts in one query, but also all orders paid from each bank account, the products in each order, and so on.
Any object persistence solution worth its salt provides functionality for fetching the data of associated instances only when the association is first accessed in Java code. This is known as lazy loading: retrieving data on demand only. This piecemeal style of data access is fundamentally inefficient in the context of an SQL database, because it requires executing one statement for each node or collection of the object network that is accessed. This is the dreaded n+1 selects problem.
This mismatch in the way you access data in Java and in a relational database is perhaps the single most common source of performance problems in Java information systems. Yet although we’ve been blessed with innumerable books and articles advising us to use StringBuffer for string concatenation, avoiding the Cartesian product and n+1 selects problems is still a mystery for many Java programmers. (Admit it: you just thought StringBuilder would be much better than StringBuffer.)
Hibernate provides sophisticated features for efficiently and transparently fetching networks of objects from the database to the application accessing them. We discuss these features in chapter 12.
We now have quite a list of object/relational mismatch problems, and it can be costly (in time and effort) to find solutions, as you may know from experience. Let’s get started with an overview of ORM, the Java Persistence standard, and the Hibernate project.
where is section 4.1
ReplyDelete