2007-05-15

dtrace & java: part 2

Do you remember my last problem with java application which behaves strangely ? Changing memory allocator from standard malloc into mtmalloc has not changed a lot. There is still high cpu pressure.
I thought for a while how to change approach to catch what is wrong with the application. Because prstat can show which thread is using cpu I started again with it:

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
28123 przemol 596M 522M sleep 59 0 15:42:15 17% java/172
15973 root 7600K 3876K sleep 59 0 0:00:00 0.1% sshd/1
15986 przemol 2840K 2612K cpu1 59 0 0:00:00 0.0% prstat/1
126 daemon 3916K 2044K sleep 59 0 0:00:00 0.0% kcfd/3
15978 przemol 1160K 932K sleep 49 0 0:00:00 0.0% ksh/1
15976 przemol 7372K 1928K sleep 59 0 0:00:00 0.0% sshd/1


And again which java threads are using cpu the most:

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
28123 przemol 12 0.0 0.0 0.0 0.0 4.8 83 0.1 152 18 83 0 java/5069
28123 przemol 2.2 0.0 0.0 0.0 0.0 98 0.0 0.0 7 4 26 0 java/4
28123 przemol 0.2 0.0 0.0 0.0 0.0 0.3 99 0.0 35 0 162 0 java/5364
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.1 6 0 70 0 java/5359
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 6 0 56 0 java/5351
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 5 0 56 0 java/5343
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 5 0 57 0 java/5355
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 4 0 70 0 java/5354
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 7 0 53 0 java/5331
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 10 0 60 0 java/5347
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 40 0 java/5325


As we can see, java thread 5069 is generating the cpu pressure. But what is doing the thread ? As I mentioned before the application is a sort of interface between mobile devices and database: whenever end user commits an operation on mobile device, it is sent to the java application and the application commits it into database. Every device is served by one, dedicated java thread. Our java programmer designed the application in a way that while creating the threads each thread has a meaningful name, e.g. 192.168.1.1-Connection-Number. So if I have thread name (the application writes all the activity into separated, connected to thread name, logs) and thread id (tid - from prstat) the only problem was how to connect them together ? Because java dtrace probes doesn't allow me (does it ?) to trace thread names during all the operations (except thread-start and thread-end probe) I had to watch all thread-start and thread-end probes during all the activity. To do so I wrote a simple DTrace script (to be more precise: two scripts) which allowed me to watch whole life (start, end and name) of all threads. The first one, istart.d :

#!/usr/sbin/dtrace -ws

proc:::exec-success
/zonename == "zone1" && execname == "java"/
/* there is only one java process in zone1 */
{
printf ("Start: pid=%d\n", pid);
system ("/root/d/itrace2.d %d", pid);
}

And itrace2.d:

#!/usr/sbin/dtrace -wqs

dvm$1:::thread-start
{
printf ("S, %10d, %35s, ",tid, copyinstr(arg0));
system("/root/print_date");
}

dvm$1:::thread-end
{
printf ("E, %10d, ", tid);
system ("/root/print_date");
}

And simple print_date one-line script:

/usr/bin/date +%Y-%m-%d--%H:%M:%S

I have run it by 'nohup istart.d &'. And the result is:

bash-3.00# head -30 nohup.out
dtrace: script './istart.d.old' matched 1 probe
dtrace: allowing destructive actions
CPU ID FUNCTION:NAME
1 554 exec_common:exec-success Start: pid=28123
S, 13, Timer-0, 2007-05-10--06:43:21
S, 14, RMI TCP Accept-0, 2007-05-10--06:43:21
S, 15, RMI TCP Accept-8000, 2007-05-10--06:43:21
E, 1, 2007-05-10--06:43:21
S, 1, DestroyJavaVM, 2007-05-10--06:43:21
S, 17, Connection Controller, 2007-05-10--06:43:21
S, 19, Server, 2007-05-10--06:43:21
S, 18, Statystyka watkow, 2007-05-10--06:43:21
S, 1, main, 2007-05-10--06:43:21
S, 20, 192.168.210.109-Connection-0, 2007-05-10--06:43:42
S, 21, 192.168.209.74-Connection-1, 2007-05-10--06:43:47
S, 22, 192.168.210.116-Connection-2, 2007-05-10--06:43:53
S, 23, 192.168.209.125-Connection-3, 2007-05-10--06:44:06
S, 24, 192.168.210.11-Connection-4, 2007-05-10--06:44:30
S, 25, 192.168.209.12-Connection-5, 2007-05-10--06:45:08
S, 26, 192.168.210.72-Connection-6, 2007-05-10--06:45:36
S, 27, 192.168.209.159-Connection-7, 2007-05-10--06:45:44
S, 28, 192.168.210.93-Connection-8, 2007-05-10--06:45:55
S, 29, 192.168.209.238-Connection-9, 2007-05-10--06:46:03
E, 23, 2007-05-10--06:46:50
E, 20, 2007-05-10--06:47:02
S, 30, 192.168.209.116-Connection-10, 2007-05-10--06:47:10
S, 31, 192.168.210.5-Connection-11, 2007-05-10--06:47:34
E, 22, 2007-05-10--06:47:49
S, 32, 192.168.209.210-Connection-12, 2007-05-10--06:49:05
S, 33, 192.168.209.55-Connection-13, 2007-05-10--06:50:11
...

First column says whether the thread is starting (S) or finishing its life (E). Second column tells thread id and third column thread name. Thus I can watch when exactly started and ended each thread. For example:

bash-3.00# grep " 20," nohup.out
S, 20, 192.168.210.109-Connection-0, 2007-05-10--06:43:42
E, 20, 2007-05-10--06:47:02

what means that thread no 20 was alive for about 3-4 minutes. In my case thread id = 5069 is the most interesting:

bash-3.00# grep " 5069" nohup.out
S, 5069, 192.168.210.185-Connection-5049, 2007-05-14--21:45:20

It started yesterday and is still alive. But the most valuable information is thread name ! Now I can look at logs related to this thread:

bash-3.00# tail log_192.168.210.185_May14
May 14, 2007 6:57:02 PM 192.168.210.185-Connection-4695 Zamykam polaczenia do bazy TC
May 14, 2007 6:57:02 PM 192.168.210.185-Connection-4695 Zamykam polaczenia do bazy WONLOK
May 14, 2007 6:57:15 PM May 14, 2007 6:57:15 PM Polaczenie z IP:192.168.210.185
May 14, 2007 6:57:15 PM 192.168.210.185-Connection-4999 IP = 192.168.210.185
May 14, 2007 6:57:15 PM May 14, 2007 6:57:15 PM 192.168.210.185 Metoda waitForCommand(Hashtable pCommands,,) ROZPOCZYNA DZIA?ANIE
May 14, 2007 6:57:15 PM pCommands.size() 21
May 14, 2007 9:45:19 PM May 14, 2007 9:45:19 PM Polaczenie z IP:192.168.210.185
May 14, 2007 9:45:19 PM 192.168.210.185-Connection-5049 IP = 192.168.210.185
May 14, 2007 9:45:19 PM May 14, 2007 9:45:19 PM 192.168.210.185 Metoda waitForCommand(Hashtable pCommands,,) ROZPOCZYNA DZIA?ANIE
May 14, 2007 9:45:19 PM pCommands.size() 21

According to author of the application the above logs indicates normal activity. So the thread is just waiting for activity on mobile device side. But:

bash-3.00# netstat -a|grep 192.168.210.185
bash-3.00#

gives nothing ! There is no ESTABLISHED state (the application exchanges data with the devices using TCP/IP protocol and sockets). Whenever I check any other thread, e.g. 5778 I get:

bash-3.00# grep " 5778" nohup.out
S, 5778, 192.168.209.162-Connection-5758, 2007-05-15--10:24:07

and

bash-3.00# netstat -a |grep 192.168.209.162
inkaso220.6400 192.168.209.162.50639 2144 28 49312 0 ESTABLISHED

So each "normal" thread has TCP/IP socket in an ESTABLISHED state. But the thread 5069 has nothing ! It seems that the mobile device has already disconnected but the application doesn't register that. I have shown that to the author and she mentioned that she doesn't check timeouts in TCP/IP sockets because there was such requirement during design (the mobile devices are connected using GPRS and it is cheaper (!) to hold already established connection for many hours then initiate a new one). She admitted that there is a line with setSoTimeout but it had been commented out. She agreed to uncomment it and see if it changes anything tomorrow after restart.

No comments: