Thursday, July 29, 2010

Process hung on library cache lock

Customer reported a system slowdown/process hung during small and well defined period of time and it was related to one specific process only. Standard Oracle 9i performance analyze tool - Statspack didn't show anything unusual in that time but shapshots were taking very 30 min and this issue took about 5 minutes.
As we could not replicate issue I thought about SASH (a free simulator of Oracle Active Session History) and I have installed it and configured. From that time I was able to gather and keep information about active sessions.
When next time I got detailed information about similar issue I was able to use collected data and found a problem. After investigation I figured out that process used for that business functionality stuck on “library cache lock” event. When I compared my findings from SASH with Statspack I realize why I didn't spotted it before - "library cache lock" was about 1.8 % of all locks between statspack snapshots. From overall database review there was no problem but from process (and business) perspective that lock hung important process for almost 5 minutes.

Below I described all major steps of my investigation and what was a root cause.
Let's start with SASH output (I just left 3 lines but there was many more similar lines):
SQL> select sample_time,program, session_id, session_state, event, seq#, sql_id, '0' || trim(to_char(p1,'XXXXXXXXXXXXXXXXX')) "p1raw", p3
  2  from v$active_session_history
  3  where 
  4  sample_time >= to_date('18-7-2010 13:50:10','dd-mm-yyyy hh24:mi:ss')
  5  and sample_time <= to_date('18-7-2010 13:54:58','dd-mm-yyyy hh24:mi:ss')
  6  and event like 'library%'
  7  order by sample_time desc
  8  /

SAMPLE_TI PROGRAM        SESSION_ID SESSION_STATE EVENT              SEQ#  SQL_ID     p1raw             P3
--------- -------------- ---------- ------------- ------------------ ----- -------    -----------       ---
18-JUL-10 importantp.exe        472 WAITING       library cache lock  214  1548901218 07000002AC789A98  210
18-JUL-10 otherprogr.exe        436 WAITING       library cache lock    8  2855865705 07000002AC789A98  210
18-JUL-10                       375 WAITING       library cache lock   33   261624711 07000002AC789A98  310

First idea was about parsing problem but when I took a look on SQL it looked OK but it was just entry point to PL/SQL procedure. Next thing was to check event parameters and there was first surprise. According to Oracle documentation and Metalink note 34579.1 that event has following parameters:
  • p1 is an address of required object - in my case 07000002AC789A98
  • p3 is a lock type and namespace type (lock mode * 100 + namespace)  - in my case 310 or 210
In Oracle definitions (metalink or decode in v$librarycache) namespace number 10 is not defined. See whole list below
•    0 SQL Area
•    1 Table / Procedure / Function / Package Header
•    2 Package Body
•    3 Trigger
•    4 Index
•    5 Cluster
•    6 Object
•    7 Pipe
•    13 Java Source
•    14 Java Resource
•    32 Java Data
So what object it is ? P1 had a address and X$KGLOB has a list of all objects (I checked that table next day but there was no restart and it was a chance that there is still that same object in shared pool as session was running 24/7):
SQL> select kglnaown "Owner", kglnaobj "Object" from x$kglob where kglhdadr='07000002AC789A98'; 

Owner  Object
------ --------
SOFT   SNDQ
What type of objects is it ?
SQL> select object_type, object_name from dba_objects where owner='SOFT' and object_name = 'SNDQ';

OBJECT_TYPE OBJECT_NAME
----------- -----------
QUEUE       SNDQ
Bingo - object was a queue there was a dequeue in that PL/SQL code. My process called importantp.exe was waiting for lock on queue. Let's check lock type:
  • 2    - Share mode 
  • 3    - Exclusive mode 
For importantp.exe lock mode in p3 parameter was 210 which mean wait for shared lock. But queuing and dequeueing operations are running with shared locks so reason for waiting had to be related to none of them. I found it in last line of my listing there was another processes waiting for exclusive lock on that same queue. Using SQL_ID (261624711)  I discovered that there was another PL/SQL piece of code and add_subscriber function was called inside. 
That was a explanation of importantp.exe hungs - as dequeue function was called after add_subscriber function it had to wait until DDL command was finished. Add subscriber function had to wait until all previous operations in shared mode will be completed and there was typical chain of locks.

There are bugs in Metalink raised by other people describing similar situations and there is only one workaround do not try any DDL on queues during traffic peaks. This recommendation is well know and most DBAs preventing database from typical DDL operation ( like index rebuild or schema changes) during business hours. The problems starts with hidden  DDL commands used in application.


Sunday, July 18, 2010

OEM performance tab and Active Session History data.

Graphic representation of Active Session History data in Database Console or in Grid Control is causing some misunderstanding of how to read it. I have lot of questions about Y scale on OEM performance tab and now I decide to blog about it. I want to describe now OEM is using a ASH date and what we can do with this information. Anyway using SQL instead of OEM is more useful in performance tuning and can be easily focused on existing issue.

For other point of view knowledge of ASH give you opportunity to use free implementation of it developed by Kyle Hailey. (Thanks Kyle for your excellent scripts – this is great staff to learn from). By the way recently I have deployed SASH on big production database running on Oracle 9i and it help me to identify and address performance problems. As soon as I finish all these work I will blog about it too.

CPU utilization
CPU usage on performance tab is in green and in my case it has upper limit set to 2 – as I have two core CPU.


It is quite clear how to read CPU usage and what is a maximum CPU utilization. Scope of time which sessions can spend on CPU has an upper limit which is 100 % utilization of all CPU. It mean that maximum CPU time for session in sampling time can be calculated using following formula

Number of CPU x Sampling Time

So if we want to calculate a CPU usage we need to use following formula:

time consumed per all sessions / (Number of CPU x Sampling Time) * 100 %

ex.
time consumed per all sessions = 16 s
Number of CPU = 2
Sampling time = 15 s 

Util of all CPUs = 15 / (2 * 15) * 100 % = 53.3 %

How to calculate CPU utilisation using Active Session History data instead of OEM performance tab ?

Active Session History table is a results of sampling V$SESSION view every second. That mean there is no “real” waiting or executing time in V$ACTIVE_SESSION_HISTORY but just samples. We are assuming that every state in ASH view (WAITING or ON CPU) took exactly 1 s. This is important as we don’t sum waiting time from ASH view but we have to count how many 1 s events were in observed time.
Be aware that DBA_HIST_ACTIVE_SESS_HISTORY has only 1/10 data of  V$ACTIVE_SESSIN_HISTORY (only sample_id modulo 10 are copied into it) and each sample is now 10s long if we want to use it for our calculations.

Following query will calculate average CPU utilization for two minutes:
SQL> select trunc(count(*)/
  2  (min(p.value)*(max(sample_id) - min (sample_id)))*100,2) "CPU_UTIL",
  3  max(sample_id) - min (sample_id) "sampling time [s]",
  4  count(*) "time consumed per sessions [s]",
  5  min(p.value) * (max(sample_id) - min (sample_id)) "maximum possible time [s]"
  6  from v$active_session_history,
  7  (select value from v$parameter  where name = 'cpu_count') p
  8  where 1=1
  9  and sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
 10  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
 11  and session_state = 'ON CPU';

  CPU_UTIL sampling time [s] time consumed per sessions [s] maximum possible time [s]
---------- ----------------- ------------------------------ -------------------------
     29.23               118                             69                       236


But this doesn’t help us in understanding OEM graph. Refresh rate is set to 15 sec so we need to display data from that period split into 15 sec slices.
SQL> select trunc(to_char(sample_time,'SSSSS')/15) "15s_sample", trunc(count(*)/
  2  (min(p.value)*15)*100,2) "CPU_UTIL",
  3  count(*) "time consumed per sessions [s]"
  4  from v$active_session_history,
  5  (select value from v$parameter  where name = 'cpu_count') p
  6  where 1=1
  7  and sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
  8  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
  9  and session_state = 'ON CPU'
 10  group by trunc(to_char(sample_time,'SSSSS')/15)
 11  order by 1;

15s_sample   CPU_UTIL time consumed per sessions [s]
---------- ---------- ------------------------------
      2972       6.66                              2
      2973       6.66                              2
      2974      53.33                             16
      2975      23.33                              7
      2976      36.66                             11
      2977      26.66                              8
      2978      33.33                             10
      2979      43.33                             13

8 rows selected.

Now if you use any graphic tool to create a chart using 15s_sample column as X axis and "CPU util" column as Y axis you should get very similar picture to OEM chart.

Wait events

There is no easily calculated upper limit for waiting sessions. Why ? Number of waits per sample is depended on session state and number of waiting sessions. So what is a maximum for that ? Should we use only active sessions or all connected sessions ? The worst case scenario is all sessions are active and all are in waiting state. Both values multiplied each other give us a potential upper limit. 

Performance tab in OEM is calculating a average of waits in sample time divided by groups and all groups are displayed using stacked area graph.

Following SQL return data used to create following graphs
SQL> select trunc(to_char(sample_time,'SSSSS')/15) "15s_sample", wait_class, count(*)/15
  2  from v$active_session_history
  3  where sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
  4  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
  5  and session_state = 'WAITING'
  6  group by trunc(to_char(sample_time,'SSSSS')/15), wait_class order by 1,3
  7  ;

15s_sample WAIT_CLASS                                                       COUNT(*)/15
---------- ---------------------------------------------------------------- -----------
      2974 Other                                                                     .2
      2974 System I/O                                                        .533333333
      2974 Commit                                                                     2
      2974 User I/O                                                          2.06666667
      2974 Concurrency                                                       4.66666667
      2975 Application                                                       .066666667
      2975 System I/O                                                        .933333333
      2975 User I/O                                                                 3.8
      2975 Commit                                                            4.93333333
      2976 Concurrency                                                       .066666667
      2976 Application                                                       .066666667
      2976 Other                                                                     .4
      2976 System I/O                                                               1.2
      2976 Commit                                                                   5.4
      2976 User I/O                                                                 5.8
      2977 System I/O                                                        .866666667
      2977 User I/O                                                          2.93333333
      2977 Commit                                                                   5.6
      2978 Concurrency                                                       .066666667
      2978 System I/O                                                        .933333333
      2978 User I/O                                                          3.26666667
      2978 Commit                                                                     4
      2979 System I/O                                                        .866666667
      2979 User I/O                                                          1.46666667
      2979 Commit                                                            2.53333333

25 rows selected.
As results we have average time spend by all sessions divided per wait class for a 1 second.  What it mean that in sample 2976 wait class “Commit” average time was 5.4 ? It exactly mean that in that 15 seconds sampling time all active sessions were 5.4*15 = 81 times waiting for events in commit class. Is it bad or not ? This is a different question and will not be answered in that post.

Now if we take a look on sample nr 2976
2976 Application                                                       .066666667
2976 Other                                                                     .4
2976 System I/O                                                               1.2
2976 Commit                                                                   5.4
2976 User I/O                                                                 5.8

Sum of all waits give us 12.8 and this is more less what we can see as a top value on graph. A differences between graph and our calculation are related to observation period start time and how we split results into 15 sampling time. We are using a group by function based on sample_time truncated into 4 periods of seconds - using group by to_char function (based on Kyle idea).These periods are: 0 – 15, 15- 30 , 30-45, 45-60 but OEM can display data based on different 15 sec groups ex. 10 - 25, 25 - 40 so small differences can occure.

Now the question is – is this a bad thing to have high numbers of OEM graph ? And like always when we are talking about Oracle performance an answer is depends. 
What is bad for sure if our CPU usage is going over 85-90 % and stay there for a long time. In that case CPU response time will have good opportunity to increase and whole system performance can go down. What to do with waits ? In my opinion a best way is to create a baseline (even on paper) with average waiting time and react only when system is working over our baseline or what is more important when system users are claiming a performance problems.

Next time I will use drill-into functionality of OEM and will try to explain what other graphs means – unfortunately most of them are changing a scale between main performance graph and detailed ones so if you are not aware of that it can cause a lot of misunderstandings.


Regards,
Marcin

Wednesday, July 14, 2010

Oracle and AIX

Hi,

Very useful document about AIX configuration for Oracle Database provided IBM.

regards,
Marcin