przemol: dtrace & java: part 2

Do you remember my last problem with java application which behaves strangely ? Changing memory allocator from standard malloc into mtmalloc has not changed a lot. There is still high cpu pressure.
I thought for a while how to change approach to catch what is wrong with the application. Because prstat can show which thread is using cpu I started again with it:


   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
 28123 przemol   596M  522M sleep   59    0  15:42:15  17% java/172
 15973 root     7600K 3876K sleep   59    0   0:00:00 0.1% sshd/1
 15986 przemol  2840K 2612K cpu1    59    0   0:00:00 0.0% prstat/1
   126 daemon   3916K 2044K sleep   59    0   0:00:00 0.0% kcfd/3
 15978 przemol  1160K  932K sleep   49    0   0:00:00 0.0% ksh/1
 15976 przemol  7372K 1928K sleep   59    0   0:00:00 0.0% sshd/1

And again which java threads are using cpu the most:


   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 28123 przemol   12 0.0 0.0 0.0 0.0 4.8  83 0.1 152  18  83   0 java/5069
 28123 przemol  2.2 0.0 0.0 0.0 0.0  98 0.0 0.0   7   4  26   0 java/4
 28123 przemol  0.2 0.0 0.0 0.0 0.0 0.3  99 0.0  35   0 162   0 java/5364
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.1   6   0  70   0 java/5359
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.0   6   0  56   0 java/5351
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.0   5   0  56   0 java/5343
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.0   5   0  57   0 java/5355
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.0   4   0  70   0 java/5354
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.0   7   0  53   0 java/5331
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.1 100 0.0  10   0  60   0 java/5347
 28123 przemol  0.1 0.0 0.0 0.0 0.0 0.0 100 0.0   3   0  40   0 java/5325

As we can see, java thread 5069 is generating the cpu pressure. But what is doing the thread ? As I mentioned before the application is a sort of interface between mobile devices and database: whenever end user commits an operation on mobile device, it is sent to the java application and the application commits it into database. Every device is served by one, dedicated java thread. Our java programmer designed the application in a way that while creating the threads each thread has a meaningful name, e.g. 192.168.1.1-Connection-Number. So if I have thread name (the application writes all the activity into separated, connected to thread name, logs) and thread id (tid - from prstat) the only problem was how to connect them together ? Because java dtrace probes doesn't allow me (does it ?) to trace thread names during all the operations (except thread-start and thread-end probe) I had to watch all thread-start and thread-end probes during all the activity. To do so I wrote a simple DTrace script (to be more precise: two scripts) which allowed me to watch whole life (start, end and name) of all threads. The first one, istart.d :


#!/usr/sbin/dtrace -ws

proc:::exec-success
/zonename == "zone1" && execname == "java"/
/* there is only one java process in zone1 */
{
 printf ("Start: pid=%d\n", pid);
 system ("/root/d/itrace2.d %d", pid);
}

And itrace2.d:


#!/usr/sbin/dtrace -wqs

dvm$1:::thread-start
{
 printf ("S, %10d, %35s, ",tid, copyinstr(arg0));
 system("/root/print_date");
}

dvm$1:::thread-end
{
 printf ("E, %10d,                                          ", tid);
 system ("/root/print_date");
}

And simple print_date one-line script:


/usr/bin/date +%Y-%m-%d--%H:%M:%S

I have run it by 'nohup istart.d &'. And the result is:


bash-3.00# head -30 nohup.out
dtrace: script './istart.d.old' matched 1 probe
dtrace: allowing destructive actions
CPU     ID                    FUNCTION:NAME
  1    554         exec_common:exec-success Start: pid=28123
S,         13,                             Timer-0,     2007-05-10--06:43:21
S,         14,                    RMI TCP Accept-0,     2007-05-10--06:43:21
S,         15,                 RMI TCP Accept-8000,     2007-05-10--06:43:21
E,          1,                                          2007-05-10--06:43:21
S,          1,                       DestroyJavaVM,     2007-05-10--06:43:21
S,         17,               Connection Controller,     2007-05-10--06:43:21
S,         19,                              Server,     2007-05-10--06:43:21
S,         18,                   Statystyka watkow,     2007-05-10--06:43:21
S,          1,                                main,     2007-05-10--06:43:21
S,         20,        192.168.210.109-Connection-0,     2007-05-10--06:43:42
S,         21,         192.168.209.74-Connection-1,     2007-05-10--06:43:47
S,         22,        192.168.210.116-Connection-2,     2007-05-10--06:43:53
S,         23,        192.168.209.125-Connection-3,     2007-05-10--06:44:06
S,         24,         192.168.210.11-Connection-4,     2007-05-10--06:44:30
S,         25,         192.168.209.12-Connection-5,     2007-05-10--06:45:08
S,         26,         192.168.210.72-Connection-6,     2007-05-10--06:45:36
S,         27,        192.168.209.159-Connection-7,     2007-05-10--06:45:44
S,         28,         192.168.210.93-Connection-8,     2007-05-10--06:45:55
S,         29,        192.168.209.238-Connection-9,     2007-05-10--06:46:03
E,         23,                                          2007-05-10--06:46:50
E,         20,                                          2007-05-10--06:47:02
S,         30,       192.168.209.116-Connection-10,     2007-05-10--06:47:10
S,         31,         192.168.210.5-Connection-11,     2007-05-10--06:47:34
E,         22,                                          2007-05-10--06:47:49
S,         32,       192.168.209.210-Connection-12,     2007-05-10--06:49:05
S,         33,        192.168.209.55-Connection-13,     2007-05-10--06:50:11
...

First column says whether the thread is starting (S) or finishing its life (E). Second column tells thread id and third column thread name. Thus I can watch when exactly started and ended each thread. For example:


bash-3.00# grep " 20," nohup.out
S,         20,        192.168.210.109-Connection-0,     2007-05-10--06:43:42
E,         20,                                          2007-05-10--06:47:02

what means that thread no 20 was alive for about 3-4 minutes. In my case thread id = 5069 is the most interesting:


bash-3.00# grep " 5069" nohup.out
S,       5069,     192.168.210.185-Connection-5049,     2007-05-14--21:45:20

It started yesterday and is still alive. But the most valuable information is thread name ! Now I can look at logs related to this thread:


bash-3.00# tail log_192.168.210.185_May14
May 14, 2007 6:57:02 PM  192.168.210.185-Connection-4695 Zamykam polaczenia do bazy TC
May 14, 2007 6:57:02 PM  192.168.210.185-Connection-4695 Zamykam polaczenia do bazy WONLOK
May 14, 2007 6:57:15 PM  May 14, 2007 6:57:15 PM Polaczenie z IP:192.168.210.185
May 14, 2007 6:57:15 PM  192.168.210.185-Connection-4999  IP = 192.168.210.185
May 14, 2007 6:57:15 PM  May 14, 2007 6:57:15 PM 192.168.210.185 Metoda waitForCommand(Hashtable pCommands,,) ROZPOCZYNA DZIA?ANIE
May 14, 2007 6:57:15 PM  pCommands.size() 21
May 14, 2007 9:45:19 PM  May 14, 2007 9:45:19 PM Polaczenie z IP:192.168.210.185
May 14, 2007 9:45:19 PM  192.168.210.185-Connection-5049  IP = 192.168.210.185
May 14, 2007 9:45:19 PM  May 14, 2007 9:45:19 PM 192.168.210.185 Metoda waitForCommand(Hashtable pCommands,,) ROZPOCZYNA DZIA?ANIE
May 14, 2007 9:45:19 PM  pCommands.size() 21

According to author of the application the above logs indicates normal activity. So the thread is just waiting for activity on mobile device side. But:


bash-3.00# netstat -a|grep 192.168.210.185
bash-3.00#

gives nothing ! There is no ESTABLISHED state (the application exchanges data with the devices using TCP/IP protocol and sockets). Whenever I check any other thread, e.g. 5778 I get:


bash-3.00# grep " 5778" nohup.out
S,       5778,     192.168.209.162-Connection-5758,     2007-05-15--10:24:07

and


bash-3.00# netstat -a |grep 192.168.209.162
inkaso220.6400       192.168.209.162.50639  2144     28 49312      0 ESTABLISHED

So each "normal" thread has TCP/IP socket in an ESTABLISHED state. But the thread 5069 has nothing ! It seems that the mobile device has already disconnected but the application doesn't register that. I have shown that to the author and she mentioned that she doesn't check timeouts in TCP/IP sockets because there was such requirement during design (the mobile devices are connected using GPRS and it is cheaper (!) to hold already established connection for many hours then initiate a new one). She admitted that there is a line with setSoTimeout but it had been commented out. She agreed to uncomment it and see if it changes anything tomorrow after restart.

przemol

2007-05-15

dtrace & java: part 2

No comments:

About Me

Blog Archive

Links

Popular Posts