Blog from July, 2012 - Brian Van Klaveren

Speeding up pipeline while it's running, OLTP vs. OLAP

Brian Van Klaveren posted on Jul 31, 2012

In addition to partiotioning tables to speed things up, we should really think about adding these tables of active streams and processinstances.

ActiveStream

This new table should be all streams that are not in a finalized state.

ActiveProcessInstance

This is the same. Maybe this makes more sense to do first.

I think we could keep simpler indexes which would keep bookkeeping to a minimum, speeding up transaction time in general and reducing row locks. In addition to this, this might be good if we move to a server-client DB model from the stored procedure model.

I'll need to detail any complications that might arise from this model.

pipeline

Extensibility and usability improvements for the Pipeline and DataCatalog

Brian Van Klaveren posted on Jul 26, 2012

One major thing I want to add is support for a RESTful JSON API for the pipeline and the datacatalog.

Right now, we have the following pipeline packages:

Server
Client
Stored Procedures

... and the same hierarchy for the datacatalog.

With exception to the stored procedure packages, all of these packages rely on org-srs-datahandling-common. I think it might be useful to move some interfaces for Task, Stream, Process, Dataset, DatasetGroup, DatasetLocation, etc... into srs-datahandling-common. Things that might get used via a web application, a web/REST application/API, the *-client and *-server projects.

Improvements to Pipeline database performance and scalability

Brian Van Klaveren posted on Jul 12, 2012

What?

I'd like to split the ProcessInstance table to only have information common to both 'BATCH' and 'SCRIPT' processinstances.

This should significantly speed up row retrieval for a lot of operations where the other information is not needed (Especially because of the WorkingDir and LogFile columns), increase database performance

Things to do:

Make a BatchProcessInstance table with the following columns:

ProcessInstance
JobSite
JobID
SubmitDate (Maybe should just be in ProcessInstance even though Jython scriptlets will never use it)
CPUSecondsUsed
MemoryUsed
SwapUsed
ExecutionHost
WorkingDir
LogFile

This should nearly eliminate row migration in the ProcessInstance table which should really, really help index performance (especially with partitioning)
We can easily do a left outer join between the two tables to get the same thing as the processinstance table.

After that, things we should do:

Simplify ProcessInstance class in org.srs.pipeline.server.sql to correspond to the reduced ProcessInstance table
Create a new BatchProcessInstance class that corresponds to the join of ProcessInstance and BatchProcessInstance
We need to deal with Row migration/chaining due updating the logfile directory. Options are:
- PCTFREE should be tuned
- A temporary table to hold BatchProcessInstance data until we get LogDir, when we get that, insert it into BatchProcessInstance
- ?

This site has some interesting information: http://www.akadia.com/services/ora_chained_rows.html

Cleaning up the Stored Procedures for the Pipeline

Brian Van Klaveren posted on Jul 10, 2012

Why?

I think we want to eventually explore support for MySQL, but also it'd be nice to really have these things cleaned up as well.

Steps:

Determine performance critical calls and calls which may iterate over large result sets. These should be ran on the database, whether in a stored procedure or a java application.
- Like SPStream and SPProcessInstance methods, along with any helpers for these classes.
Code which is called by DatabaseUtilities that doesn't conform to 1 should just be moved to DatabaseUtilities
Clean up both DBU and stored procedures, separate out the SQL from the code, aim to reduce the LoC in DBU and Stored procedure classes
Eventually move away from oracle's jpub remote calls to something else (autocommit needs to be off), we still want SPs on the DB server, but we might want to move away from Oracle sometime in the future.
After this, we'll also want to move towards handling more concurrency issues in code vs. in the database to reduce the likelihood of locks I think.

Cleaning up java code by pulling out SQL and using QueryLoader

Brian Van Klaveren posted on Jul 10, 2012

Let's think about moving sql out of java in order to make queries easier to read and allow for alternate queries according to database

A good first step I think would be to use QueryLoader and put the sql in a properties file. Maybe we could put our own query loader in org-srs-datahandling-common.

http://commons.apache.org/dbutils/apidocs/org/apache/commons/dbutils/QueryLoader.html

Example of use:

sql.properties

sqlstatement=\
select someid, someproperty from sometable where someid = ?

sqlstatementtrue=\
select someid from sometable\
    where someproperty = ?

MyClass.java

    final Map queryMap;
    public MyClass(){
        queryMap = QueryLoader.instance().load("/org/srs/project/sql.properties");
    }

    void someMethod(){
         Connection connection = null;
         PreparedStatement stmt = null;
         ResultSet rs = null;
         try{
             connection = Utils.getConnection();
             stmt = connection.prepareStatement(queryMap.get("sqlstatement"));
             stmt.setLong(1,1234);
             rs = stmt.executeQuery();
             if(rs.next()){
                 System.out.println(rs.getString(2);
             }
         catch (Exception e){
             System.out.println("Some error happened");
         }
    };

First blog post about coding

Brian Van Klaveren posted on Jul 10, 2012

I think I'm going to try to document coding and problems I run in using the blog format on my profile, although maybe a general "Coding Blog" might be better.

We'll see how the labels work too

1 Comment ·

Testing blog post

Brian Van Klaveren posted on Jul 10, 2012

Testing my first blog post.