[UPDATED]Pagination on DB results

Recently at work I faced this classic problem:

The dataset in DB is too large to be fetch by your java app without a OutOfMemoryException

This is what pagination comes into place –  load results page by page.

in case of complicated where clause, of course it is slower

The solution is clear, it is about implementation.

I am a lazy developer who used to pagination API like that provided by Hibernate, thus I expected the answer to be simpler.

While I am using  Oracle 11g with Spring’s jdbctemplate, the methods details seems quite scattered. Thus here I formulated the generalized answer:

Besides Hibernate pagination, we can use SQL pagination / JDBC pagination

SQL pagination

There are two basic approaches:

  • operating on piecemeal result set (New Query for Each Page)
  • operating on full result set

The way to do it is SQL specific

For MySQL / many other SQLs it can be done with limit and offset

Postgresql: http://microjet.ath.cx/WebWiki/ResultPaginationWithPostgresql.html

In Oracle, it use the same form as to handle “Top-N query” e.g. who are the 5 highest paid employee, which is optimized

[sourcecode language=”sql”]

select *   from ( select a.*, rownum rnum

from ( YOUR_QUERY_GOES_HERE — including the order by ) a

where rownum <= MAX_ROWS )

where rnum >= MIN_ROWS

[/sourcecode]

Here is a very detailed explanation on ROW-NUM

Similar SO Thread

JDBC pagination

The question comes into mind is: when I execute the SQL, how is the result being loaded? Immediately or on request?  same as this SO thread

First we need to understand some basics of JDBC, as from Oracle

Per javadoc: statement.execute()

execute: Returns true if the first object that the query returns is a ResultSet object. Use this method if the query could return one or more ResultSet objects. Retrieve the ResultSet objects returned from the query by repeatedly calling Statement.getResutSet.

We access data in Resultset via a cursor. Note this cursor is different from that of DB while it is a pointer initially positioned before the first row of data.

The data is fetch on request. while when you do the execute() you are fetching for the first time.

Then, how many data is loaded? It is configurable.

One can use the java API setFetchSize() method on ResultSet to  control how many rows are fetched from DB a time by the driver,  how big the blocks it retrieves at once.

For example assume the total result is 1000.  If fetch size is 100, fetching the 1st row will load 100 rows from DB and 2nd to 100th row will be loaded from local memory.to query 101st row another 100 rows will be load into memory.

From JavaDoc

Gives the JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for ResultSet objects genrated by this Statement. If the value specified is zero, then the hint is ignored. The default value is zero.

Note the word “hint” – it can be override by driver specific implementation.

This is also what the “Limit Rows to 100” feature in client like SQL developer based on.

Complete the whole solution, to scroll results, one need to consider the ResultSet Types and ScrollableCursor in API

One can find an example implementation from this post in oracle

which is from the book Oracle Toplink Developer’s Guide

Example 112 JDBC Driver Fetch Size


ReadAllQuery query = new ReadAllQuery();

query.setReferenceClass(Employee.class);

query.setSelectionCriteria(new ExpressionBuilder.get("id").greaterThan(100));

// Set the JDBC fetch size

query.setFetchSize(50);

// Configure the query to return results as a ScrollableCursor

query.useScrollableCursor();

// Execute the query

ScrollableCursor cursor = (ScrollableCursor) session.executeQuery(query);

// Iterate over the results

while (cursor.hasNext()) {

System.out.println(cursor.next().toString());

}

cursor.close();

.....................

After all, the questions boil to

Which is the better way to do pagination?

Note the SQL should be ORDER by to make sense in the SQL approach,

Otherwise it is possible to show some rows again in next page.

It seems they are not mutual exclusive. When one use JDBC to execution paging SQL, he also needs to consider the fetch size. (I wonder is it setting the fetch size equals paging size as optimal, or is the SQL pagination meaningless?)

Below is some points from Postgresql’s documentation  on JDBC Driver  and other SO answers

 

The major difference is for the JDBC way, it is required to hold the connection during the fetching. This may not be suitable in stateless web application, for example.

 For SQL way

  • the syntax is SQL specific and may not be easy to maintain.

For JDBC way

  • The connection to the server must be using the V3 protocol. This is the default for (and is only supported by) server versions 7.4 and later.
  • The Connection must not be in autocommit mode. The backend closes cursors at the end of transactions, so in autocommit mode the backend will have closed the cursor before anything can be fetched from it.
  • The Statement must be created with a ResultSet type of ResultSet.TYPE_FORWARD_ONLY. This is the default, so no code will need to be rewritten to take advantage of this, but it also means that you cannot scroll backwards or otherwise jump around in the ResultSet.
  • The query given must be a single statement, not multiple statements strung together with semicolons.

 

[UPDATED]Memory consumption for FetchSize

Some points to note after prd testing: To my surprise, I found that the JDBC code will allocate memory according to the fetch size.  It will create lots of object in JDBC no matter there is such record or not. e.g. setFetchSize(10000) will pre-create 10000 object requried according the the schema (more memory if fields in DB is longer).

Thus it is advised either predetermine the record# by a count sql, or to increase the fetchSize in a greedy manner (not possible for the method queryForList, but possible if you use springs’ queryForRowSet.  related SO thread. You check whether the cursor size exceed the threshold, the increase the fetch size. finer multi level control is also possible. The trade off here is the threshold and the number of turnaround request needed. (a simple large paging size is fastest)

e.g. do this check

while(resultSet.next()){

if(level ==0 && rowNum>level1Threshold){

resultSet.setFetchSize(level2Threshold);

level=1;

} else if (level==1 && rowNum>level2Threshold){

resultSe.setFetchSize(maxPagingSize);

}

//do sth

}

 

 

More optimization tested

Using Spring ‘s spring’s resultset wrapper curosr ResultSetWrappingSqlRowSet  may be faster and easier, since queryForList create objects in both jdbc and the List you get, while this method let you control the cursor to loop the result, create objects on demand for the list you get (for jdbc the memory hit still applies)

speed improvement of setFetchSize is, of course, more relevant to actual getting objects, not for looping only  ResultSetWrappingSqlRowSet

i.e. faster for

    (resultSet.next()){
        results.add(this.rowMapper.mapRow(this.resultSet.getResultSet(), rowNum++));
        }

but not

    (resultSet.next()){
    //do nothing
    }


Some Further reading

This post is about performance tuning with optical fetch size

http://webmoli.com/2009/02/01/jdbc-performance-tuning-with-optimal-fetch-size/

http://blog.lishman.com/2008/03/jdbc-fetch-size.html

Salute, Evernote

I have been using Evernote for a long time, and more and more lately.

As I continue to work on my file-organizing research project (continue to rush before the deadline), finally I found out what I have been using to do th e work is actually quite similar to what I want to do.

Indeed, after all organizing file & note share same objectives, at least for human logical view. What we need to organize is a piece of information and the way we do so should be simple & fast enough, which is the reason I started the project.

Although I am not a big fan of Evernote as a user (I am a big fan of Onenote)(And thats why this post is not an advertisement), after I studied the internals of it I found it quite impressive.

Familiar words & Jargons are seen. Sharding, Job Node, WebDAV, Lucene… Many problems have been faced and they do it pretty well. Quite a lot for me to learn.

Some impressive things I noted:

Internals

As in post of “A Digest of Evernote’s Architecture” The basic stack of Evernote is

Debian + Java 6 + Tomcat + Hibernate + Ehcache +  Stripes +GWT + MySQL (for metadata) + hierarchical local file systems (for file data).

To my surprise, MySQL is being used (Once upon a time a famous blogger said it is the only db with data integrity issue, source:forgotten!). Anyway none of the DB in the world I am familiar with, therefore I could not comment.

And for the question Why SQL (at all), you can find the official answer in this post

To sum up, it is the ACID properties, which ease their awesome synchronization model.

Sharding

It is meaningless to create a data storage application if it cannot scale. Evernote do this amazingly. It leverage its advantage — basically there is not much to be shared among users in terms of data (ignoring shared notebooks).  Every User’s data is localized in one shard, replicated in 2 VMs in it using DRBD.  Each shard handle both data & application logic for 100,000 users. This model scales in horizontal pretty nice. Here sharding is not a buzz word like what echoes in some companies’ meeting rooms.

To keep these servers manageable,  the automation deployment is doing good of course.

This is how they improve the model recently, where one point of improvement is for the great IO produced by the Lucene Indexing. The latest model introduce metadata+app shard using SSD
Cost drop from $0.10/ hardware & 3.7mW/power  to  $0.05 and 1.4mW per user.

The searching capability

Perhaps the most crucial part. When you create a new note, its content is immediately searchable. This is extremely useful, besides normal searching this speed is a feature which save the day when I am organizing notes create now and then, trying to referring back to the note I am working on. And searching are not limited to plain text, but also Pdf and images.

There are unbelievable state-of-the-art technology and dedicated infrastructure being used for image processing, even for free accounts.

For some details, read here for the indexing system, esp on AIR Servers which are for image processing and even more details

Successful Freemium Model

Compare to many nice & popular websites that with  yet-to-sure business model (remember the ad-free youtube?), Evernote’s Freemium model is among the most successful ones. (Dropbox comes to mind, of course)

Evernote IS making money. Numbers speak for themselves

34M Users, 1.4M Paying User.   $30 million, triples annualy   1B valuation, IPO in todo list (2012)

Sharing the technical details

Not every company, esp closed source, wants to share their technical stuff like Evernote does.  Sometimes you will read article like facebook’s self-healing engine. While the tech blog in Evernote is really detailed and nice written. Some companies I know have even worse internal documentation compare to these blogs.

It is a culture of sharing, let people criticize the design and tell them what they want. This is what I am most thankful for and how I can write this blog post.

This post is about the tech & business side. I skipped lots of details (while they are in my evernote notebook.) Let see if I have time & interesting points to discuss later.

I am going to write another post on some on my experience with it / smart tools with using it.

Tika in action

Finally finished the book <<Tika in action>>

a free read thanks to my friend west

I didn’t need to master tika yet, while the book serves as a nice window to the world of searching & Apache projects in this area

It covers a lot of topics, while not very detailed for each of them.

To put it simple, Tika is a parser, a powerful one.

Spawned from the search engine project Lucene, it is specialized for searching use cases.

It is capable in detecting file types, encoding and even language out of the box.

Using a uniform parse api, it will use various parser libraries , like PDFbox for pdf to extract and analyze the documents, capturing any metadata.

These information can easily feed into search engine indexer.

It take care special handling that such a parser component will need – incremental extraction / type & encoding & language detection , optimize to use random access for metadata before parsing,
do memory access only if needed and more.

Currently it forms a lower level component in the apache searching ecosystem, as the book put it.

For this, it also support modularize as an OSGI bundle.

Apache search EcoSystem

In the Ecosystem, there are numerous apache projects that link to each other, which I will explore.

For the typical search engine use case, one can apply Tika, Lucene & Solr to do facet searching & host it in a web server,

or in Nutch, an apache version of Google, using Tika together with Hadoop for indexing with Map Reduce, then use Gora to do data storage like BigTable. Apache Accumulo

I also discovered Jackrabbit, to do content repository management, which is the relevant project that exists.

 

Mapout for Machine learning algorithm seems also be useful in my research.

Another thing I benefited is the book also introduced some metadata model like Dublin Core and general file handling concepts, language detection algorithm etc. This makes me understand more on the field I am doing research, better idea on what my topic actually is.

 

 

I made many notes while I dont want to copy the book here. I will try to share together with my usage in my project

There are always introduction-content-summary in the book, good for memory but make the actual content relatively superficial and short after those boilerplate.
IMO this is not very good if you are skimming through for some insights. Sometimes I want to skip but found I almost missed some important concepts.