Cross Database Data Integration

Cross Database Data Integration
- Database Transactions -
Project Document (v. 1.1)

Dusko Koncaliev koncadu@earlham.edu
Kennedy Mutio mutioke@earlham.edu

Table of Contents

Log
Overview
Functional Specification
Design Specification

Log

Nov. 07, 1998 V. 1.1., revised with specifics about invocation of Transaction Manager, results from Transaction Manager, Error Code contents, Busy Wait, Write Ahead Log, Recovery, and other minor changes.

Nov. 04, 1998 V. 1.0. Original version of the project document. Needs revision.

Overview

At this point in the information age, numerous firms, institutions, and even individuals run into problems of trying to manage their data, which may be stored in different types of databases. They are often faced with a dilemma. They have the choice to transform their old data into a new database in order to be able to use the old data along with new data, or try to find a way of integrating the data from multiple different databases.
With this project, we chose to explore the second option and tackle the various issues that are involved in Cross Database Data Integration and Transaction Processing.
This project document is aimed at presenting the essence of the Cross Database Data Integration project, and doing so in an organized and concise manner. The essence and backbone of the project is to enable one to use multiple databases such as Oracle7, Postgres, Sybase, etc., and be able to store and retrieve data from all of them simultaneously. At this point, the project is only concentrating on the use of Oracle7 and Postgres databases, but once this goal is accomplished, it would not be hard to provide for the use of multiple other databases.

Functional Specification

The primary problems in the domain of this project are:

Being able to connect to multiple databases and maintaining multiple instances in those particular databases.
Storing and retrieving data from those databases in a quick and secure manner by having transactions that have the ACID properties.
Utilizing the Two-Phase commit protocol in order to insure that the transactions have the ACID properties.

Usage Models
Transactions to Oracle and Postgres databases
The user will be able to carry out transactions in the Oracle7 and the Postgres databases. At this point the project concentrates only on DML transactions such as storing or manipulating data from an existing database. The primary transaction not talked about in this document is the selection of data, which does not necessarily need to use the two-phase commit protocol and could be implemented otherwise. The user will be able to send the SQL statements that define what the transaction is.
The user will invoke the Transaction Manager [TM] by sending the TM a string that is composed of information about the databases used and the SQL statements that are to be carried out. The TM will be a server sitting on a specific port and waiting for clients to invoke it. The Busy Wait mechanism will be implemented, which means that the client will have to wait for the server (the TM) to finish its work and report back to the client.
The user (client) will receive results from the TM that clearly state whether the transaction requested was successful or not. The result will be an array of strings which, in the case of success will have only two elements (i.e. OK = 1, result = commit or rollback) and in the case of a failure will have multiple elements (i.e. NOT OK = -1, and reason of failure including the SQL statement that caused the failure). The user will have faith that the transactions are carried out with all the ACID properties, whether the transactions requested were committed or rolled back.

Design Specification

Applications and tools used:

Oracle7
Postgres
SQL
Perl with DBI

Data representation:

MAX_TRANS = 10 (the maximum number of transactions supported at the same time)

MAX_SQL = 8 (the maximum number of SQL statements a transaction can support, i.e. the maximum number of resource managers under a transaction)

trans_array [MAX_TRANS] (an array of pointers which holds the transaction manager; points to the array that holds the resource managers)

res_arary [MAX_SQL] (an array of pointers which hold the resource managers; points to a structure (an array of strings) that holds important information relevant to the resource managers, like $rid, $dbh, $rc, $sql, $res_vote, $shm_key, $child_pid)

db_array[MAX_SQL] (an array that holds the $dbh handles for the different databases at this point we choose to have that in an array of 2 because we know the databases we are supporting.)

$trid (a transaction id = [0..9] for its place in the trans_array + localtime_in_seconds)

$rid (a resource id = first_digit_of_trid + [0..9] for its place in the res_array + localtime_in_seconds)

Flow of instructions:
The transaction manager is invoked with a $string that holds DB-SQL pairs. The DB holds the information about the database where the SQL will be executed. At this point (to avoid string manipulation for getting the database information used by DBI) we will have it hardcoded that when

DB = Postgres then the db_info will include "dbi:Pg:dbname=indstudy"
DB = Oracle then the db_info will include "instance", "user", "password", "Oracle7".

The transaction manager splits the $string into n substrings. If n>8 then we stop processing since our MAX_SQL is 8. Otherwise, the transaction manager generates its $trid, writes ahead to the log that the transaction specified by $trid has started, and it stores the $trid into trans_array.
For all n substrings, the transaction manager gets the DB-SQL pairs, generates a $rid which gets stored in the res_array, and logs that the resource manager $rid has started. If not already existent, a connection to the appropriate database is established and the database handle is stored in the db_array for future use.
At this point, the transaction manager forks a child process:

the parent process stores the $child_pid in res_array, gets shared memory space, and stores shm_key in res_array.
the child process runs the resource manager with $dbh handle and $sql statement.

The resource manager executes the SQL and decides on commit versus rollback. It also stores the $rc handle, the SQL statement, and the res_vote into the res_array. Further, the resource manager writes the $res_vote to shared memory, after which the child exits. In the meantime, the parent process is waiting on the result from the child process by reading shared memory increasing the count for each commit vote, after which it deletes the entry from shared memory.
If the count to commit is the same as the number (n) of resource managers, i.e. SQL statements, then the transaction manager makes the decision to commit, otherwise it decides to rollback. The decision is at this point logged.
For each resource manager, the transaction manager logs that a resource manager is started with the specific decision. After the logging, the transaction manager forks a child process and stores the child_pid in res_array.
The child process invokes the resource manager with $dbh handle, $rc handle, and a decision. The resource manager either commits or rollbacks, depending on the decision, and writes 'done' in the shared memory, after which the child process exits.
In the meantime, parent waits on result from the resource manager, and counts the 'done' statements and shared memory entry is deleted. Once all the resource managers have reported completion, the transaction manager logs a completion record. This marks the end of the transaction, after which we may clean up all the records connected to that transaction.

Below is a diagram of the flow of instructions:

1. TM invoked with $string
2. TM logs - transaction started
3. TM decides on number of resource managers and forks a process for each RM. It logs before it starts each RM and then invokes RMs with SQL to be executed
4. Execution of SQL against the specific database
5. RM decides on commit/rollback
6. After all RMs have voted, TM decides on overall decision and logs it
7. TM invokes each RM with the decision
8. RMs commit/rollback
9. RMs inform TM of work completed
10. After all RMs have completed work, TM logs transaction completion

Recovery process:
The Transaction Manager makes sure that the log is kept up to date and durable at all times in case of a system crash or any other kind of failure. At the point of failure the transaction may be at a state where it is not atomic, which goes against the ACID properties of transactions that we strive to provide. To battle this, a recovery process needs to take place. With the initial startup, the Transaction Manager checks the transaction log to see whether there are any transactions that haven't finished prior to a possible failure. If there is a transaction that hasn't been finished, the Transaction Manager gets all the information about that transaction from the transaction log and makes sure the transaction is executed until finished.

Nov. 07, 1998	V. 1.1., revised with specifics about invocation of Transaction Manager, results from Transaction Manager, Error Code contents, Busy Wait, Write Ahead Log, Recovery, and other minor changes.
Nov. 04, 1998	V. 1.0. Original version of the project document. Needs revision.

MAX_TRANS = 10	(the maximum number of transactions supported at the same time)
MAX_SQL = 8	(the maximum number of SQL statements a transaction can support, i.e. the maximum number of resource managers under a transaction)
trans_array [MAX_TRANS]	(an array of pointers which holds the transaction manager; points to the array that holds the resource managers)
res_arary [MAX_SQL]	(an array of pointers which hold the resource managers; points to a structure (an array of strings) that holds important information relevant to the resource managers, like $rid, $dbh, $rc, $sql, $res_vote, $shm_key, $child_pid)
db_array[MAX_SQL]	(an array that holds the $dbh handles for the different databases at this point we choose to have that in an array of 2 because we know the databases we are supporting.)
$trid	(a transaction id = [0..9] for its place in the trans_array + localtime_in_seconds)
$rid	(a resource id = first_digit_of_trid + [0..9] for its place in the res_array + localtime_in_seconds)