Blog

There's a Safari on-line e-book available for java RESTful services

RESTful Java with JAX-RS is available for reading online.

Strategies for long queries and RESTful services

I was thinking of adding a ConcurrentHashMap/AtomicInterger to a given service for very get requests.

The idea and strategy is this:

  1. For requests we think might take a long time, we start a new Asynchronous query. These requests shouldn't be requests that are performed very often. This query is assigned a new AtomicInteger.
  2. When the query executes, we set a timer, maybe 10 seconds. If the query finishes, we just return the result like normal.
  3. If the query doesn't finish, however, we create a new AsyncQueryStatus object and added to the ConcurrentHashMap with the key being the AtomicInteger. We then return that status object.
  4. The client can poll to a new REST resource which corresponds to the ConcurrentHashMap for status. While the query is still executing, the same status will be returned.
  5. When the query is finished, the status in the ConcurrentHashMap is updated and the original object is added.
  6. The client gets the result on the next polling.

Just some thoughts.

Getting your bind variables back

  1. Get a heap dump. 
    • For the pipeline:
      1. Hook up jconsole to the instance
      2. select com.sun.management
      3. select operations
      4. click on dumpHeap. The dump will be local to java instance you connected to, put in a file name, and click true.
      5. Transfer that file back to your computer
  2. Open up the heap dump in Netbeans
    • Go to Profile -> Load Heap Dump
    • Select your heap dump
  3. Find your statements with OQL:
    • For prepared statements use:
      select {instance: sql, content: sql.sqlObject.odbc_sql.toString()} from oracle.jdbc.driver.T2CPreparedStatement sql
    • For callable statements (i.e. stored procedures) use:
      select {instance: sql, content: sql.sqlObject.odbc_sql.toString()} from oracle.jdbc.driver.T2CCallableStatement sql
  4. For the prepared statements, take a look at the content. The content should be the sql you are looking for.
  5. You're going to have to look through the bind variables in the T2C*Statement.  Usually called lastBound* or boundInt, boundLong, etc...

In addition to partiotioning tables to speed things up, we should really think about adding these tables of active streams and processinstances.

ActiveStream

This new table should be all streams that are not in a finalized state. 

ActiveProcessInstance

This is the same. Maybe this makes more sense to do first.

I think we could keep simpler indexes which would keep bookkeeping to a minimum, speeding up transaction time in general and reducing row locks. In addition to this, this might be good if we move to a server-client DB model from the stored procedure model. 

I'll need to detail any complications that might arise from this model.

One major thing I want to add is support for a RESTful JSON API for the pipeline and the datacatalog.

Right now, we have the following pipeline packages:

  • Server
  • Client
  • Stored Procedures

... and the same hierarchy for the datacatalog.

With exception to the stored procedure packages, all of these packages rely on org-srs-datahandling-common.  I think it might be useful to move some interfaces for Task, Stream, Process, Dataset, DatasetGroup, DatasetLocation, etc... into srs-datahandling-common.  Things that might get used via a web application, a web/REST application/API, the *-client and *-server projects.

What?

I'd like to split the ProcessInstance table to only have information common to both 'BATCH' and 'SCRIPT' processinstances.

This should significantly speed up row retrieval for a lot of operations where the other information is not needed (Especially because of the WorkingDir and LogFile columns), increase database performance

Things to do:

Make a BatchProcessInstance table with the following columns:

  • ProcessInstance
  • JobSite
  • JobID
  • SubmitDate (Maybe should just be in ProcessInstance even though Jython scriptlets will never use it)
  • CPUSecondsUsed
  • MemoryUsed
  • SwapUsed
  • ExecutionHost
  • WorkingDir
  • LogFile

This should nearly eliminate row migration in the ProcessInstance table which should really, really help index performance (especially with partitioning)
We can easily do a left outer join between the two tables to get the same thing as the processinstance table.

After that, things we should do:

  1. Simplify ProcessInstance class in org.srs.pipeline.server.sql to correspond to the reduced ProcessInstance table
  2. Create a new BatchProcessInstance class that corresponds to the join of ProcessInstance and BatchProcessInstance
  3. We need to deal with Row migration/chaining due updating the logfile directory.  Options are:
    • PCTFREE should be tuned
    • A temporary table to hold BatchProcessInstance data until we get LogDir, when we get that, insert it into BatchProcessInstance
    • ?

This site has some interesting information: http://www.akadia.com/services/ora_chained_rows.html

Why?

I think we want to eventually explore support for MySQL, but also it'd be nice to really have these things cleaned up as well.

Steps:

  1. Determine performance critical calls and calls which may iterate over large result sets.  These should be ran on the database, whether in a stored procedure or a java application.
    • Like SPStream and SPProcessInstance methods, along with any helpers for these classes.
  2. Code which is called by DatabaseUtilities that doesn't conform to 1 should just be moved to DatabaseUtilities
  3. Clean up both DBU and stored procedures, separate out the SQL from the code, aim to reduce the LoC in DBU and Stored procedure classes
  4. Eventually move away from oracle's jpub remote calls to something else (autocommit needs to be off), we still want SPs on the DB server, but we might want to move away from Oracle sometime in the future.
  5. After this, we'll also want to move towards handling more concurrency issues in code vs. in the database to reduce the likelihood of locks I think.
Let's think about moving sql out of java in order to make queries easier to read and allow for alternate queries according to database

A good first step I think would be to use QueryLoader and put the sql in a properties file.  Maybe we could put our own query loader in org-srs-datahandling-common.

http://commons.apache.org/dbutils/apidocs/org/apache/commons/dbutils/QueryLoader.html

Example of use:

sql.properties
sqlstatement=\
select someid, someproperty from sometable where someid = ?

sqlstatementtrue=\
select someid from sometable\
    where someproperty = ?

MyClass.java
    final Map queryMap;
    public MyClass(){
        queryMap = QueryLoader.instance().load("/org/srs/project/sql.properties");
    }

    void someMethod(){
         Connection connection = null;
         PreparedStatement stmt = null;
         ResultSet rs = null;
         try{
             connection = Utils.getConnection();
             stmt = connection.prepareStatement(queryMap.get("sqlstatement"));
             stmt.setLong(1,1234);
             rs = stmt.executeQuery();
             if(rs.next()){
                 System.out.println(rs.getString(2);
             }
         catch (Exception e){
             System.out.println("Some error happened");
         }
    };

I think I'm going to try to document coding and problems I run in using the blog format on my profile, although maybe a general "Coding Blog" might be better.

We'll see how the labels work too

Testing blog post

Testing my first blog post.