Blog from July, 2012

In addition to partiotioning tables to speed things up, we should really think about adding these tables of active streams and processinstances.

ActiveStream

This new table should be all streams that are not in a finalized state. 

ActiveProcessInstance

This is the same. Maybe this makes more sense to do first.

I think we could keep simpler indexes which would keep bookkeeping to a minimum, speeding up transaction time in general and reducing row locks. In addition to this, this might be good if we move to a server-client DB model from the stored procedure model. 

I'll need to detail any complications that might arise from this model.

One major thing I want to add is support for a RESTful JSON API for the pipeline and the datacatalog.

Right now, we have the following pipeline packages:

  • Server
  • Client
  • Stored Procedures

... and the same hierarchy for the datacatalog.

With exception to the stored procedure packages, all of these packages rely on org-srs-datahandling-common.  I think it might be useful to move some interfaces for Task, Stream, Process, Dataset, DatasetGroup, DatasetLocation, etc... into srs-datahandling-common.  Things that might get used via a web application, a web/REST application/API, the *-client and *-server projects.

What?

I'd like to split the ProcessInstance table to only have information common to both 'BATCH' and 'SCRIPT' processinstances.

This should significantly speed up row retrieval for a lot of operations where the other information is not needed (Especially because of the WorkingDir and LogFile columns), increase database performance

Things to do:

Make a BatchProcessInstance table with the following columns:

  • ProcessInstance
  • JobSite
  • JobID
  • SubmitDate (Maybe should just be in ProcessInstance even though Jython scriptlets will never use it)
  • CPUSecondsUsed
  • MemoryUsed
  • SwapUsed
  • ExecutionHost
  • WorkingDir
  • LogFile

This should nearly eliminate row migration in the ProcessInstance table which should really, really help index performance (especially with partitioning)
We can easily do a left outer join between the two tables to get the same thing as the processinstance table.

After that, things we should do:

  1. Simplify ProcessInstance class in org.srs.pipeline.server.sql to correspond to the reduced ProcessInstance table
  2. Create a new BatchProcessInstance class that corresponds to the join of ProcessInstance and BatchProcessInstance
  3. We need to deal with Row migration/chaining due updating the logfile directory.  Options are:
    • PCTFREE should be tuned
    • A temporary table to hold BatchProcessInstance data until we get LogDir, when we get that, insert it into BatchProcessInstance
    • ?

This site has some interesting information: http://www.akadia.com/services/ora_chained_rows.html

Why?

I think we want to eventually explore support for MySQL, but also it'd be nice to really have these things cleaned up as well.

Steps:

  1. Determine performance critical calls and calls which may iterate over large result sets.  These should be ran on the database, whether in a stored procedure or a java application.
    • Like SPStream and SPProcessInstance methods, along with any helpers for these classes.
  2. Code which is called by DatabaseUtilities that doesn't conform to 1 should just be moved to DatabaseUtilities
  3. Clean up both DBU and stored procedures, separate out the SQL from the code, aim to reduce the LoC in DBU and Stored procedure classes
  4. Eventually move away from oracle's jpub remote calls to something else (autocommit needs to be off), we still want SPs on the DB server, but we might want to move away from Oracle sometime in the future.
  5. After this, we'll also want to move towards handling more concurrency issues in code vs. in the database to reduce the likelihood of locks I think.
Let's think about moving sql out of java in order to make queries easier to read and allow for alternate queries according to database

A good first step I think would be to use QueryLoader and put the sql in a properties file.  Maybe we could put our own query loader in org-srs-datahandling-common.

http://commons.apache.org/dbutils/apidocs/org/apache/commons/dbutils/QueryLoader.html

Example of use:

sql.properties
sqlstatement=\
select someid, someproperty from sometable where someid = ?

sqlstatementtrue=\
select someid from sometable\
    where someproperty = ?

MyClass.java
    final Map queryMap;
    public MyClass(){
        queryMap = QueryLoader.instance().load("/org/srs/project/sql.properties");
    }

    void someMethod(){
         Connection connection = null;
         PreparedStatement stmt = null;
         ResultSet rs = null;
         try{
             connection = Utils.getConnection();
             stmt = connection.prepareStatement(queryMap.get("sqlstatement"));
             stmt.setLong(1,1234);
             rs = stmt.executeQuery();
             if(rs.next()){
                 System.out.println(rs.getString(2);
             }
         catch (Exception e){
             System.out.println("Some error happened");
         }
    };

I think I'm going to try to document coding and problems I run in using the blog format on my profile, although maybe a general "Coding Blog" might be better.

We'll see how the labels work too

Testing blog post

Testing my first blog post.