2007-05-21

dtrace & java: part 3

My last struggles with application in java led me to the following observations (again from the very begin):

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
27607 inkaso 502M 436M sleep 59 0 10:59:30 5.4% java/91
517 root 3916K 3080K sleep 59 0 0:02:09 0.1% nscd/26
18198 root 2848K 2732K cpu1 59 0 0:00:00 0.0% prstat/1
18202 root 964K 676K sleep 59 0 0:00:00 0.0% sleep/1
[...]

prstat -mL -p 27607

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
27607 inkaso 3.7 0.0 0.0 0.0 0.0 4.4 92 0.1 163 4 101 2 java/5142
27607 inkaso 0.8 0.0 0.0 0.0 0.0 99 0.0 0.0 7 2 15 0 java/4
27607 inkaso 0.1 0.0 0.0 0.0 0.0 0.2 100 0.0 6 0 127 0 java/5676
27607 inkaso 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 8 1 72 0 java/5679
27607 inkaso 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 4 0 52 0 java/5662
[...]

Thread id = 5142 is:

bash-3.00# grep 5142 nohup.out
S, 5142, 192.168.210.13-Connection-5122, 2007-05-21--07:24:16
S, 5162, 192.168.209.216-Connection-5142, 2007-05-21--07:33:21

bash-3.00# netstat -a|grep 192.168.210.13
inkaso220.6400 192.168.210.135.50835 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.13.51108 2144 0 49312 0 CLOSE_WAIT

It means, again, that some thread (here 5142) is doing some work instead of being already dead. The state of network connection is CLOSE_WAIT and according to: sunsolve document "indefinite CLOSE_WAIT state normally indicates some application level bug.".
And more:

bash-3.00# while true
> do
> date
> netstat -a|grep 192.168.210.13
> sleep 30
> done
Mon May 21 09:12:18 CEST 2007
inkaso220.6400 192.168.210.139.50235 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.135.50835 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.13.51108 2144 0 49312 0 CLOSE_WAIT
inkaso220.6400 192.168.210.131.59018 5840 0 49640 0 ESTABLISHED
Mon May 21 09:12:49 CEST 2007
inkaso220.6400 192.168.210.139.50235 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.135.50835 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.13.51108 2144 0 49312 0 CLOSE_WAIT
inkaso220.6400 192.168.210.131.59018 5840 0 49640 0 ESTABLISHED
Mon May 21 09:13:19 CEST 2007
inkaso220.6400 192.168.210.139.50235 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.135.50835 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.13.51108 2144 0 49312 0 CLOSE_WAIT
inkaso220.6400 192.168.210.131.59018 5840 0 49640 0 ESTABLISHED
...
Mon May 21 10:21:07 CEST 2007
inkaso220.6400 192.168.210.135.50835 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.13.51108 2144 0 49312 0 CLOSE_WAIT
Mon May 21 10:21:37 CEST 2007
inkaso220.6400 192.168.210.135.50835 5840 0 49640 0 ESTABLISHED
inkaso220.6400 192.168.210.13.51108 2144 0 49312 0 CLOSE_WAIT
...

The changes which our java programmer did last time has not helped anything. Meanwhile a new software developer took over the software (the previous one is tasked with another one) and now I am going to pass all my observations to him.

2007-05-15

How to change zone configuration ?

I always forget how to change configuration of already configured zone. I hope from now on I won't forget it.

How to remove fs already defined/assigned to a zone:

bash-3.00# zonecfg -z urticaS
zonecfg:urticaS> info
zonepath: /zones/urticaS
autoboot: true
pool:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
fs:
dir: /d8
special: /vxfs/sowa-urticaS-vol
raw not specified
type: lofs
options: []
net:
address: 192.168.220.113
physical: ce3
zonecfg:urticaS> remove fs dir=/d8
zonecfg:urticaS> info
zonepath: /zones/urticaS
autoboot: true
pool:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address: 192.168.220.113
physical: ce3
zonecfg:urticaS> verify
zonecfg:urticaS> commit
zonecfg:urticaS> exit

How to change fs already defined/assigned to a zone:

bash-3.00# zonecfg -z obs1
zonecfg:obs1> select fs dir=/d8
zonecfg:obs1:fs> info
fs:
dir: /d8
special: /vxfs/drozd-obs1-vol
raw not specified
type: lofs
options: []
zonecfg:obs1:fs> set special=/vxfs/wilga3-obs1-vol
zonecfg:obs1:fs> info
fs:
dir: /d8
special: /vxfs/wilga3-obs1-vol
raw not specified
type: lofs
options: []
zonecfg:obs1:fs> end
zonecfg:obs1> verify
zonecfg:obs1> commit
zonecfg:obs1> exit


Just for myself ... :-)

dtrace & java: part 2

Do you remember my last problem with java application which behaves strangely ? Changing memory allocator from standard malloc into mtmalloc has not changed a lot. There is still high cpu pressure.
I thought for a while how to change approach to catch what is wrong with the application. Because prstat can show which thread is using cpu I started again with it:

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
28123 przemol 596M 522M sleep 59 0 15:42:15 17% java/172
15973 root 7600K 3876K sleep 59 0 0:00:00 0.1% sshd/1
15986 przemol 2840K 2612K cpu1 59 0 0:00:00 0.0% prstat/1
126 daemon 3916K 2044K sleep 59 0 0:00:00 0.0% kcfd/3
15978 przemol 1160K 932K sleep 49 0 0:00:00 0.0% ksh/1
15976 przemol 7372K 1928K sleep 59 0 0:00:00 0.0% sshd/1


And again which java threads are using cpu the most:

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
28123 przemol 12 0.0 0.0 0.0 0.0 4.8 83 0.1 152 18 83 0 java/5069
28123 przemol 2.2 0.0 0.0 0.0 0.0 98 0.0 0.0 7 4 26 0 java/4
28123 przemol 0.2 0.0 0.0 0.0 0.0 0.3 99 0.0 35 0 162 0 java/5364
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.1 6 0 70 0 java/5359
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 6 0 56 0 java/5351
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 5 0 56 0 java/5343
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 5 0 57 0 java/5355
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 4 0 70 0 java/5354
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 7 0 53 0 java/5331
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.1 100 0.0 10 0 60 0 java/5347
28123 przemol 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 40 0 java/5325


As we can see, java thread 5069 is generating the cpu pressure. But what is doing the thread ? As I mentioned before the application is a sort of interface between mobile devices and database: whenever end user commits an operation on mobile device, it is sent to the java application and the application commits it into database. Every device is served by one, dedicated java thread. Our java programmer designed the application in a way that while creating the threads each thread has a meaningful name, e.g. 192.168.1.1-Connection-Number. So if I have thread name (the application writes all the activity into separated, connected to thread name, logs) and thread id (tid - from prstat) the only problem was how to connect them together ? Because java dtrace probes doesn't allow me (does it ?) to trace thread names during all the operations (except thread-start and thread-end probe) I had to watch all thread-start and thread-end probes during all the activity. To do so I wrote a simple DTrace script (to be more precise: two scripts) which allowed me to watch whole life (start, end and name) of all threads. The first one, istart.d :

#!/usr/sbin/dtrace -ws

proc:::exec-success
/zonename == "zone1" && execname == "java"/
/* there is only one java process in zone1 */
{
printf ("Start: pid=%d\n", pid);
system ("/root/d/itrace2.d %d", pid);
}

And itrace2.d:

#!/usr/sbin/dtrace -wqs

dvm$1:::thread-start
{
printf ("S, %10d, %35s, ",tid, copyinstr(arg0));
system("/root/print_date");
}

dvm$1:::thread-end
{
printf ("E, %10d, ", tid);
system ("/root/print_date");
}

And simple print_date one-line script:

/usr/bin/date +%Y-%m-%d--%H:%M:%S

I have run it by 'nohup istart.d &'. And the result is:

bash-3.00# head -30 nohup.out
dtrace: script './istart.d.old' matched 1 probe
dtrace: allowing destructive actions
CPU ID FUNCTION:NAME
1 554 exec_common:exec-success Start: pid=28123
S, 13, Timer-0, 2007-05-10--06:43:21
S, 14, RMI TCP Accept-0, 2007-05-10--06:43:21
S, 15, RMI TCP Accept-8000, 2007-05-10--06:43:21
E, 1, 2007-05-10--06:43:21
S, 1, DestroyJavaVM, 2007-05-10--06:43:21
S, 17, Connection Controller, 2007-05-10--06:43:21
S, 19, Server, 2007-05-10--06:43:21
S, 18, Statystyka watkow, 2007-05-10--06:43:21
S, 1, main, 2007-05-10--06:43:21
S, 20, 192.168.210.109-Connection-0, 2007-05-10--06:43:42
S, 21, 192.168.209.74-Connection-1, 2007-05-10--06:43:47
S, 22, 192.168.210.116-Connection-2, 2007-05-10--06:43:53
S, 23, 192.168.209.125-Connection-3, 2007-05-10--06:44:06
S, 24, 192.168.210.11-Connection-4, 2007-05-10--06:44:30
S, 25, 192.168.209.12-Connection-5, 2007-05-10--06:45:08
S, 26, 192.168.210.72-Connection-6, 2007-05-10--06:45:36
S, 27, 192.168.209.159-Connection-7, 2007-05-10--06:45:44
S, 28, 192.168.210.93-Connection-8, 2007-05-10--06:45:55
S, 29, 192.168.209.238-Connection-9, 2007-05-10--06:46:03
E, 23, 2007-05-10--06:46:50
E, 20, 2007-05-10--06:47:02
S, 30, 192.168.209.116-Connection-10, 2007-05-10--06:47:10
S, 31, 192.168.210.5-Connection-11, 2007-05-10--06:47:34
E, 22, 2007-05-10--06:47:49
S, 32, 192.168.209.210-Connection-12, 2007-05-10--06:49:05
S, 33, 192.168.209.55-Connection-13, 2007-05-10--06:50:11
...

First column says whether the thread is starting (S) or finishing its life (E). Second column tells thread id and third column thread name. Thus I can watch when exactly started and ended each thread. For example:

bash-3.00# grep " 20," nohup.out
S, 20, 192.168.210.109-Connection-0, 2007-05-10--06:43:42
E, 20, 2007-05-10--06:47:02

what means that thread no 20 was alive for about 3-4 minutes. In my case thread id = 5069 is the most interesting:

bash-3.00# grep " 5069" nohup.out
S, 5069, 192.168.210.185-Connection-5049, 2007-05-14--21:45:20

It started yesterday and is still alive. But the most valuable information is thread name ! Now I can look at logs related to this thread:

bash-3.00# tail log_192.168.210.185_May14
May 14, 2007 6:57:02 PM 192.168.210.185-Connection-4695 Zamykam polaczenia do bazy TC
May 14, 2007 6:57:02 PM 192.168.210.185-Connection-4695 Zamykam polaczenia do bazy WONLOK
May 14, 2007 6:57:15 PM May 14, 2007 6:57:15 PM Polaczenie z IP:192.168.210.185
May 14, 2007 6:57:15 PM 192.168.210.185-Connection-4999 IP = 192.168.210.185
May 14, 2007 6:57:15 PM May 14, 2007 6:57:15 PM 192.168.210.185 Metoda waitForCommand(Hashtable pCommands,,) ROZPOCZYNA DZIA?ANIE
May 14, 2007 6:57:15 PM pCommands.size() 21
May 14, 2007 9:45:19 PM May 14, 2007 9:45:19 PM Polaczenie z IP:192.168.210.185
May 14, 2007 9:45:19 PM 192.168.210.185-Connection-5049 IP = 192.168.210.185
May 14, 2007 9:45:19 PM May 14, 2007 9:45:19 PM 192.168.210.185 Metoda waitForCommand(Hashtable pCommands,,) ROZPOCZYNA DZIA?ANIE
May 14, 2007 9:45:19 PM pCommands.size() 21

According to author of the application the above logs indicates normal activity. So the thread is just waiting for activity on mobile device side. But:

bash-3.00# netstat -a|grep 192.168.210.185
bash-3.00#

gives nothing ! There is no ESTABLISHED state (the application exchanges data with the devices using TCP/IP protocol and sockets). Whenever I check any other thread, e.g. 5778 I get:

bash-3.00# grep " 5778" nohup.out
S, 5778, 192.168.209.162-Connection-5758, 2007-05-15--10:24:07

and

bash-3.00# netstat -a |grep 192.168.209.162
inkaso220.6400 192.168.209.162.50639 2144 28 49312 0 ESTABLISHED

So each "normal" thread has TCP/IP socket in an ESTABLISHED state. But the thread 5069 has nothing ! It seems that the mobile device has already disconnected but the application doesn't register that. I have shown that to the author and she mentioned that she doesn't check timeouts in TCP/IP sockets because there was such requirement during design (the mobile devices are connected using GPRS and it is cheaper (!) to hold already established connection for many hours then initiate a new one). She admitted that there is a line with setSoTimeout but it had been commented out. She agreed to uncomment it and see if it changes anything tomorrow after restart.