Saturday, September 1, 2007

DB2 V8 Error Saga continues...

Our DB2 system was converted from NFM (New function mode) to ENFM (Enhanced New Function Mode) & immediately to CM (Compatibility mode) last week...And like in all other conversions we had problems in this conversion too...This time it was with a load job which abended with a file length mismatch error...

We were unloading a DB2 System parameter CURRENT SQLID in our unload card and this caused our job to bomb in production because the length of the variable in which the SQLID is stored was increased and with the DB2 upgrade the variable definition got upgraded to VARCHAR instead of CHAR(8) before...

This caused the IBM unload utility DSNTIAUL to increase the record length of the Unload file...Thus our load step got abended because junk data was getting uploaded into the table...As per the recommendation by DB2 Technical Services we altered the PARMS Parameter in the run card from PARMS('SQL') to PARMS('SQL,1)

DSN PROGRAM(DSNTIAUL) PLAN(DSNTEP2) PARMS('SQL',1) ---> From PARMS('SQL') to PARMS('SQL,1)...

But it didnt work..our job failed once again in production..Then we did a thorough analysis and found that the CURRENT SQLID in the unload card was the problem...We then modified the query to limit the CURRENT SQLID selection by using the typecast function CHAR(8) thus limiting the data to 8 characters...Our job ran fine with this change..

Ooohhhh! Hopefully this is the last error that we are getting out of this DB2 V8 upgrade as all the three stages have been rolled out and we are in the final mode of V8 Compatibility mode (CM)...

- Suresh Gangadharan

Saturday, March 31, 2007

DB2 Dynamic Cache Switch ON

We keep getting problems because of DB2 V7 to V8 Upgrade...Last week we had a problem wherein our DB2 Admin turned of the Dynamic Cache Switch ON and it caused some problems in our time critical application load jobs...

Last week on Sunday, 4 to 5 of our SELECT and Load jobs were down and there were around 88 jobs that were waiting in ESP to run after these abended jobs to complete the entire schedule by Monday 7:00 AM...

We worked closely with the DB2 Admin group and asked them if there were any change in DB2 in this weekend, since these abended jobs were running smoothly all these days...For this, the DB2 group replied back stating that the DYNAMIC CACHE switch on DB2 in production was turned ON, which made us believe that this could be the problem behind the jobs down...

The job was down with the following error:

---------------------------------------------------------------------------------
Error From SYSOUT:
----------------------
READY ,,
DSN SYSTEM (DSN)
DSN
RUN PROGRAM (DSNTIAUL) PLAN (DSNTIAUL) PARMS ('SQL'
DSN ENDED DUE TO ERROR+

SYSTEM ABEND CODE 0C4 REASON CODE 00000038 PSW 077C20019744FCC6 ASIDS:
HOME,= 004E PRIMARY = 00E1 SECONDARY = 00E1

---------------------------------------------------------------------------------
Sample Job:
------------

//ABNDSTEP EXEC PGM=IKJEFT01,DYNAMNBR=20
//SYSOUT DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
//SYSTSOUT DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
//SYSTSIN DD *
DSN SYSTEM (DSN)
RUN PROGRAM (DSNTIAUL) +
PLAN (DSNTIAUL) +
PARMS ('SQL') +
LIBRARY ('SYS3.RUNLIB.LOAD')
/*
//SYSREC00 DD DSN=,DISP=(,CATLG,CATLG),
//UNIT=DASD,SPACE=(CYL,(5,5),RLSE),
//DCB=(RECFM=FB,LRECL=562)
//SYSREC01 DD DSN==,DISP=(,CATLG,CATLG),
//UNIT=DASD,SPACE=(CYL,(5,5),RLSE),
//DCB=(RECFM=FB,LRECL=562)
...
...
...
//SYSREC11 DD DSN==,DISP=(,CATLG,CATLG),
//UNIT=DASD,SPACE=(CYL,(5,5),RLSE),
//DCB=(RECFM=FB,LRECL=562)
//SYSPUNCH DD DUMMY
//*
//****Please note that I have shown one of the query of the 11
//****queries. Each query is the same as the one shown below
//****with just difference in the predicates and they would be
//****unloaded to above 11 SYSRECs in the JCLs...All of them
//****are present in the same control card and each query is
//****separated by a semi colon (;)...The actual query is thus
//****4500 lines...
//*
//SYSIN DD *

SELECT
FISC_YR
,FISC_WK
,CAST ('ABT' AS CHAR(03))
,CAST (9999 AS SMALLINT) AS COL1
,CAST (0 AS SMALLINT) AS COL2
,CAST (COL_A AS SMALLINT) AS COL3
,CAST (9999 AS SMALLINT) AS COL4
...
...

,CAST (COALESCE( STDDEV(DURATION_HRS) / 24 ,0) AS DECIMAL(5,1)) AS ST_DEV
,CAST (0 AS DECIMAL(4,1)) AS TOTAL_90
,CAST (0 AS DECIMAL(4,1)) AS TOTAL_98

FROM
(
SELECT FISC_YR,FISC_WK,COL_A,COL_B,COL_C FROM
A
INNER JOIN
B
ON A.COL1 = B.COL1

WHERE A.TYPE = 'D'
AND B.COL_5 = 'ST'

UNION ALL

SELECT FISC_YR,FISC_WK,COL_A,COL_B,COL_C FROM
A
INNER JOIN
B
ON A.COL1 = B.COL1

WHERE
A.TYPE = 'D'
AND B.COL_5 = 'ST'
AND A.STATE = 'TX'

) X

GROUP BY
FISC_YR
, FISC_WK
, X.COL_A
, X.COL_B
, COL3

UNION

SELECT
FISC_YR
,FISC_WK
,CAST ('ABT' AS CHAR(03))
,CAST (9999 AS SMALLINT) AS COL1
,CAST (0 AS SMALLINT) AS COL2
,CAST (COL_A AS SMALLINT) AS COL3
,CAST (9999 AS SMALLINT) AS COL4
...
...
,CAST (COALESCE( STDDEV(DURATION_HRS) / 24 ,0) AS DECIMAL(5,1)) AS ST_DEV
,CAST (0 AS DECIMAL(4,1)) AS TOTAL_90
,CAST (0 AS DECIMAL(4,1)) AS TOTAL_98
FROM
(
SELECT FISC_YR,FISC_WK,COL_A,COL_B,COL_C FROM
A
INNER JOIN
B
ON A.COL1 = B.COL1

WHERE A.TYPE = 'D'
AND B.COL_5 = 'ST'

UNION ALL

SELECT
FISC_YR
,FISC_WK
,COL_A
,COL_B
,COL_C

FROM
A

INNER JOIN
B

ON A.COL1 = B.COL1

WHERE
A.TYPE = 'D'
AND B.COL_5 = 'ST'
AND A.STATE = 'TX'

) X

GROUP BY
FISC_YR
, FISC_WK
, COL4
, X.COL_A
, X.COL_B
, COL3 ;

/*
//*
//*--------------------------------------------------------------------------*
//* ABNDSTP1 STEP WILL ABEND IF ABOVE STEP HAD CODE CODE > 4 *
//*--------------------------------------------------------------------------*
// IF ABNDSTEP.RC GT 4 THEN
//ABNDSTP1 EXEC PGM=ABENDER,PARM=020
// ENDIF
//*
---------------------------------------------------------------------------------
The problem in the above query was the STDDEV function which was trying to refer a column in the inner query...Without the STDDEV the job ran fine...But we couldnt jeopardize on the results by removing the STDDEV function and replacing it with some other function....

Hence we requested DYNAMIC CACHING to be turned OFF in production to the DB2 group...They did and we went ahead and restarted the jobs...Everything ran fine after that...We also completed the entire schedule within 6:00 AM Monday well within the SLA time of Monday 7 AM...


Further the DB2 group has opened up a ticket with IBM for a patch that could be applied to make sure this query works out when the DYNAMIC CACHE is turned ON again in production because according to them the DYNAMIC CACHE turn ON would result in significant performance improvement....

Apparently we were able to replicate the Error in Test environment and DB2 group is taking the abend dumps from the test and sending it to IBM...In the following days, we would try to apply the patch first in Test and run our jobs...If they run fine over in Test with the DYNAMIC CACHE switch Turned ON, then they would apply the patch in production too and turn on the DYNAMIC CACHE switch in production too...

- Suresh Gangadharan

Tuesday, February 27, 2007

Restart of Load Replace Job

One of our purge job for a huge volume table abended yesterday for space abend (SB37) on the SYSMAP dataset. The allocation was CYL(100,100) but due to high volume of data the allocation was not enough on the SYSMAP dataset and the job went down.

I checked up the tablespace and Indexspace...They were put up in the RW,RECP,UTUT and RW,REBP,UTUT mode respectively...The job was loading the data to the table after filtering out the data with the purge conditions...It was doing a refresh of the table data with RESUME NO REPLACE...I wasn't sure of how to restart the job whether a phased restart - resuming it from the abended phase or terminating the active utility and restarting from the load step...

After referring to some IBM sites, I went ahead and restarted the job from the load step by fixing the JCL for more allocation SPACE=(DASD,5) after terminating the active utility id using
-TERM UTIL(active util ID)...The job went fine without any issues...

- Suresh Gangadharan

Wednesday, December 6, 2006

DB2 V8 Upgrade

Here is another problem which is still unresolved and it is haunting us for the past two days...Wonder! where in the world DB2 gets such strange problems like these...more it poses these kinda of problems more interesting it is...

Problem:

Our DB2 system was upgraded to Version 8...All our jobs were running fine these days until yesterday we noticed a problem with an Online app which accesses our database...Whenever we try to run the query through this Online app we are getting the following TCP/IP communication protocol error...We tried running the DB2 query in SPUFI which abended with the following hex abend...

Abend 0C4000 hex occurred processing command 'DSN '.,
***,


We checked out the DSNMSTR and we could repeatedly see the following error piling up in the DSNMSTR whenever we try to run the query against the DB2 databases.

- SERVER DISTRIBUTED AGENT WITH 485 LUWID=GA6FD43E.BF95.02BE01202707=33439
THREAD-INFO=userid:server:userid:dllhost.exe
RECEIVED ABEND=0C4
FOR REASON=00000000
- GA6FD43E.BF95.02BE01202707=33439 486
ACCESSING DATA FOR
LOCATION ipaddress
IPADDR ipaddress

Initially we thought that there is something wrong with the query...But the same query worked fine for a different search criteria which made us think that the DB2 objects used by the query might have some possible data corruption...Hence we did a CHECK PAGE on all the Tablespaces to see if any page of the tablespace is corrupted or not by running the Image copy jobs with CHECK PAGE option...But they didnt throw any error...We even tried out CHECK DATA and CHECK INDEX utilities against the Tablespaces/Indexspaces but no use...Still the page was abending with the communication protocol error pasted below....


Error Number: -2147168228
Error Description: SQL30081N
A communication error has been detected. Communication protocol being used: "TCP/IP". Communication API being used: "SOCKETS". Location where the error was detected: "ipaddress".
Communication function detecting the error: "recv". Protocol specific error code(s): "*", "*", "0". SQLSTATE=08001
Error Source: IBM OLE DB Provider for DB2


Query:

(Select
VT.COL1
,VT.COL2
,rtrim(ltrim(VT.COL3))
,Case(when VT.COL4 is null
then ' '
else VT.COL4
end)
from
(
Select
COL1
,COL2
,COL3
,COL4
from

T1 inner join T2
on T1.col1 = T2.col1
and T1.col2 = T2.col2
and T1.col1 = 1729

... where ..
)VT
)

Seeing the query we feel that it is having a problem during the sort phases in the buffer since when we run the inner query the response time is fast and no errors are observed. Also the additional predicate to the inner query works which is even more strange since both of the inner queries for the specific example that fails will never return and rows that would require the sort/merge to the outer portion of the SQL.

Solution:

Still the problem persists...Donno whats going on in the background of DB2...We have opened a ticket with IBM and concluded this problem as DB2 V8 problem as doing a search in the web...

http://www.dbtalk.net/ibm-software-db2-mvs/informatica-problem-after-db2-version-280956.html

Still our app gets this error...Thank God its a less frequently used Online App and users hardly give it a try which is welcome news for us...But surely this is haunting us and we are waiting on IBM for the fix....Once i have the fix from IBM i will post the solution to this problem by editing the article...

If anyone of u reading this article have faced similar problem while working in DB2 please post the comments on how you tackled it....

- Suresh Gangadharan

Monday, December 4, 2006

LOCKSIZE Parameter

Problem:

Our System has a segmented tablespace against which is accessed by multiple plans running at the same time. The concurrent plans try to insert and update the database at the same time...This causes lot of contention and occassionally our plans go down with -911 (deadlock condition)

Possible Solutions:

1) We could have the tablespace change to a Partitioned one to increase the parallelism.
2) Introduce -911 retry logic to have the plan wait for specified seconds and then try to access the tablespace.
3) Check the LOCKSIZE parameter of the tablespace definition...This parameter decides the degree of locks to be taken on tablespace be it page/row/table/Any...For better concurrency for random processing use LOCKSIZE = ROW...

We applied option 2 and 3 and all our concurrent plans are running better than before the change...

- Suresh Gangadharan

DB2 problem postings

This blog will primarily focus on the issues that I face in my work and also the solution for the problems that I fixed...Obviously, this would be updated only when I get a problem working in DB2...I pray to God that this one be updated the least, which means my work life is smooth :-)

One more thing is that I try to post up the situation and the scenario too along with each and every problem...So readers are advised to attempt the solution posted only if the scenario matches with theirs...Also the solutions posted in this forum are practically applied in my work...

- Suresh Gangadharan