Wednesday, February 10, 2016

Why Puppet was using 96% CPU while ran as a daemon or through command prompt?

Problem :  Puppet was consuming 96% CPU while it ran as a daemon or ran as "puppet agent -t".
Environment: RHEL 6
Solution : Make sure application puppet module is tested with out any error and unwanted issue.
Root Cause Analysis: 

top has shown following:


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11403 root 20 0 259m 140m 4516 R 96.7 1.9 9:08.77 puppet
1. Traced puppet agent which was run at command prompt while I kept daemon off.
$strace -Tvv  puppet agent -t  &> ./strace_log

2. read() system calls are spawned a lot and continuously printing following messages:
[...]
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
[...]
3. While I was tracking filename of corresponding file descriptor, I found /home/qpadm/temp was opened that had size 101GB.
4. read() call was continuously trying to read data from this file and read size was 4096 Bytes in every call. Trace looked below:
22:36:37 open("/home/qpadm/temp", O_RDONLY) = 4 <0 .000032="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000027="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 fstat(4, {st_dev=makedev(202, 1), st_ino=790228, st_mode=S_IFREG|0600, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=209715224, st_size=107374182400, st_atime=2016/02/03-07:37:15, st_mtime=2016/02/03-05:11:57, st_ctime=2016/02/03-07:36:48}) = 0 <0 .000027="">
22:36:37 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000030="">
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000035="">
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
In above trace, you see file /home/qpadm/temp is opened in read mode as FD(file descriptor) is 4 and same FD is passed to read() call that returns 4096 Bytes incrementally. It supposed to take long time to read 101GB file[:) would never be finished using huge read calls those would have neen performed by kernel]. That's why CPU spiked to 60-80% consumed by puppet agent.
We need to check why above file is there and what is the usage of it and why it has large size.
This one is I think main culprit.
Another thing I noticed is either ovo was reinstalled or being migrated to newer version. In relation to this, I found following calls :
22:36:00 open("/root/ovoinstall/HPOvPacc.xml", O_RDONLY) = 4 <0 .000022="">

22:36:07 open("/root/ovoinstall/OVO-Agent-Migrate.xml", O_RDONLY) = 4 <0 .000032="">

22:36:07 fstat(4, {st_dev=makedev(202, 1), st_ino=786799, st_mode=S_IFREG|0744, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=4240, st_size=2168841, st_atime=2016/02/03-17:04:19, st_mtime=2016/01/03-22:12:54, st_ctime=2016/01/03-22:15:32}) = 0 <0 .000027="">
22:36:07 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000029="">
22:36:07 read(4, "\355\253\356\333\3\0\0\0\0\1HPOvConf-11.11.025-1\0\0"..., 4096) = 4096 <0 .000032="">
22:36:07 read(4, "\"Unable to copy file\"\nOVCSL_ERRO"..., 4096) = 4096 <0 .000029="">

Application team found that temp was not needed but was called in application puppet module. They disabled it and removed that file. It fixed the issue.


Wednesday, February 3, 2016

why vxdisk does throw error "VxVM vxdctl ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed To recover"

Problem:

Following errors has been encountered :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
VxVM vxdctl ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
        fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
        fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#


[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks
VxVM vxdisk ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
        fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
        fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#

Solution referred :


1. check disk :

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_85  auto:cdsdisk    myapps_pet_data00  flex3a_myapps_pet_data_dg online thinrclm
fas31400_88  auto:cdsdisk    myapps_pet_data01  flex3a_myapps_pet_data_dg online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
fas31400_92  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_93  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_94  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_95  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_96  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_97  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_98  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_99  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31401_39  auto:cdsdisk    myapps_pet_data02  flex3a_myapps_pet_data_dg online thinrclm
fas31402_41  auto:cdsdisk    myapps_pet_data03  flex3a_myapps_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

2. Removed all disk of myapp and myapps like as it is stated above :


[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-534 Device fas31400_85: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31400_88: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31401_39: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31402_41: Device is in use
[root@customer-pet-db-3a ~]# vxdg list
NAME         STATE           ID
flex3a_pet_customerumprd_arch_dg enabled,cds          1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds          1376078555.40.customer-pet-db-3a.example.net
flex3a_myapps_pet_data_dg enabled,cds          1375818468.64.customer-pet-db-3c.example.net
flex3a_pet_customerumprd_dg enabled,cds          1446683837.59.customer-pet-db-3b.example.net

flex3a_myapps_pet_data_dg  is still in use. Need to deport it.

[root@customer-pet-db-3a ~]# /sbin/vxdg deport flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]# vxdg list
NAME         STATE           ID
flex3a_pet_customerumprd_arch_dg enabled,cds          1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds          1376078555.40.customer-pet-db-3a.example.net
flex3a_pet_customerumprd_dg enabled,cds          1446683837.59.customer-pet-db-3b.example.net
[root@customer-pet-db-3a ~]#


Remove disks which were in use : 

[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_93: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_94: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_95: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_96: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_97: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_98: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_99: no such object in the configuration
[root@customer-pet-db-3a ~]#

don't worry above error as we have already removed disk of myapp. That's why above errors are displayed.

check disk now:

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
[root@customer-pet-db-3a ~]#

3. Enable and Scan disk again :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks

Didn't see any more error.

4. Export dg now :

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapp_pet_data_dg import myapp_pet_data_dg
[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_85  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_88  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
fas31400_92  auto:cdsdisk    myapp_pet_data00  flex3a_myapp_pet_data_dg online thinrclm
fas31400_93  auto:cdsdisk    myapp_pet_data06  flex3a_myapp_pet_data_dg online thinrclm
fas31400_94  auto:cdsdisk    myapp_pet_data03  flex3a_myapp_pet_data_dg online thinrclm
fas31400_95  auto:cdsdisk    myapp_pet_data07  flex3a_myapp_pet_data_dg online thinrclm
fas31400_96  auto:cdsdisk    myapp_pet_data05  flex3a_myapp_pet_data_dg online thinrclm
fas31400_97  auto:cdsdisk    myapp_pet_data04  flex3a_myapp_pet_data_dg online thinrclm
fas31400_98  auto:cdsdisk    myapp_pet_data01  flex3a_myapp_pet_data_dg online thinrclm
fas31400_99  auto:cdsdisk    myapp_pet_data02  flex3a_myapp_pet_data_dg online thinrclm
fas31401_39  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31401_40  auto:cdsdisk    myapp_pet_data08  flex3a_myapp_pet_data_dg online thinrclm
fas31401_41  auto:cdsdisk    myapp_pet_data11  flex3a_myapp_pet_data_dg online thinrclm
fas31401_42  auto:cdsdisk    myapp_pet_data10  flex3a_myapp_pet_data_dg online thinrclm
fas31401_43  auto:cdsdisk    myapp_pet_data09  flex3a_myapp_pet_data_dg online thinrclm
fas31402_41  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31402_45  auto:cdsdisk    myapp_pet_data14  flex3a_myapp_pet_data_dg online thinrclm
fas31402_46  auto:cdsdisk    myapp_pet_data12  flex3a_myapp_pet_data_dg online thinrclm
fas31402_47  auto:cdsdisk    myapp_pet_data15  flex3a_myapp_pet_data_dg online thinrclm
fas31402_48  auto:cdsdisk    myapp_pet_data13  flex3a_myapp_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

Received error for myapps, 

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapps_pet_data_dg import myapps_pet_data_dg
VxVM vxdg ERROR V-5-1-10978 Disk group myapps_pet_data_dg: import failed:
No valid disk found containing disk group
[root@customer-pet-db-3a ~]#

DG is already renamed as flex3a_myapps_pet_data_dg. Just import it simply.

[root@customer-pet-db-3a ~]# vxdg import flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]#


check dg now :

[root@customer-pet-db-3a ~]# vxdg list|grep -i myapp
flex3a_myapp_pet_data_dg enabled,cds          1375815375.127.customer-pet-db-3c.example.net
flex3a_myapps_pet_data_dg enabled,cds          1375818468.64.customer-pet-db-3c.example.net
[root@customer-pet-db-3a ~]#

Confirmed with DBA, they can start DB instances.