Sunday, June 23, 2019

Java gateway process exited before sending its port number

Problem
While creating spark sql session, received following error message.

Exception: Java gateway process exited before sending its port number

Environment

OS - RHEL 7
Jupyter notebook

Steps to reproduce

configure jupyter notebook
Start Jupyter
Access Jupyter webpage
Run program in notebook like below

from pyspark import SparkContext, SparkConf

from pyspark.sql import SparkSession

now click on "run"

Solution

Add following in .bashrc and restart jupyter notebook

#For jupyter notebook:
export PYSPARK_SUBMIT_ARGS="--master yarn-client pyspark-shell"

Root Cause Analysis

If you look at the code below

https://github.com/apache/spark/blob/master/python/pyspark/java_gateway.py#L104

it says connection file is not being created in tmp directory. In my case issue was environment variable "PYSPARK_SUBMIT_ARGS" was not setup. This caused to not to create "proc"

submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")

...

command = command + shlex.split(submit_args)

...

proc = Popen(command, **popen_kwargs)

...

# Wait for the file to appear, or for the process to exit, whichever happens first. while not proc.poll() and not os.path.isfile(conn_info_file): time.sleep(0.1) if not os.path.isfile(conn_info_file): raise Exception("Java gateway process exited before sending its port number")

...

Wednesday, November 8, 2017

Docker build encounters error "failed to unshare namespaces: Invalid argument"

Issue

_[91mnsenter: failed to unshare namespaces: Invalid argument
_[0m_[91mcontainer_linux.go:265: starting container process caused "process_linux.go:270: running exec setns process for init caused \"exit status 34\""
_[0moci runtime error: container_linux.go:265: starting container process caused "process_linux.go:270: running exec setns process for init caused \"exit status 34\""

Environment

RHEL 7.2

Solution

Execute below commands as a root user and reboot system.

grubby --args="user_namespace.enable=1" --update-kernel="$(grubby --default-kernel)"

reboot

Root cause

usernamespace was enable at docker daemon level but same was not enabled in kernel level.

Docker daemon runs like this:

ExecStart=/usr/bin/dockerd --live-restore --userland-proxy=false --icc=false --disable-legacy-registry --userns-remap=docker:docker --tlsverify --tlscacert=/var/SP/docker/certs/ca.pem --tlscert=/var/SP/docker/certs/cert.pem --tlskey=/var/SP/docker/certs/key.pem -g /var/SP/docker -H tcp://0.0.0.0:2376 -H unix://var/run/docker.sock --log-driver=syslog --log-opt syslog-address=tcp://localhost:514 --log-opt tag="DOCKER"

Docker build encounters "operation timed out"

Issue

while running docker build, we see following error.

fetch http://dl-4.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz: operation timed out
fetch http://dl-4.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz: operation timed out
ERROR: unsatisfiable constraints:

Solution

Pass proxy as environment variable and unset at the end of Dockerfile file.

ENV http_proxy http://proxyhost.com:3128
ENV https_proxy http://proxyhost.com:3128
...
RUN unset http_proxy https_proxy

Root cause Analysis

During docker build on a machine which is behind proxy server can encounter this issue. Requests do not go through running Docker engine but go directly. hence, you need to mention proxy in Dockerfile and unset later.
Second issue may happen that upstream server is not available. This is very rare. You can view with below command to see current repositories

RUN cat /etc/apk/repositories

Default values are

http://dl-cdn.alpinelinux.org/alpine/v3.4/main
http://dl-cdn.alpinelinux.org/alpine/v3.4/community

You can check through browser if you can see if domain is resolving or not. you can replace with actual domain like below

RUN sed -i 's/dl-cdn.alpinelinux.org/dl-4.alpinelinux.org/' /etc/apk/repositories

Wednesday, February 10, 2016

Why Puppet was using 96% CPU while ran as a daemon or through command prompt?

Problem : Puppet was consuming 96% CPU while it ran as a daemon or ran as "puppet agent -t".
Environment: RHEL 6
Solution : Make sure application puppet module is tested with out any error and unwanted issue.
Root Cause Analysis:

top has shown following:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11403 root 20 0 259m 140m 4516 R 96.7 1.9 9:08.77 puppet

1. Traced puppet agent which was run at command prompt while I kept daemon off.

$strace -Tvv puppet agent -t &> ./strace_log

2. read() system calls are spawned a lot and continuously printing following messages:

[...]
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
[...]

3. While I was tracking filename of corresponding file descriptor, I found /home/qpadm/temp was opened that had size 101GB.
4. read() call was continuously trying to read data from this file and read size was 4096 Bytes in every call. Trace looked below:

36:37 open("/home/qpadm/temp", O_RDONLY) = 4 <0 .000032="">
36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000027="">
36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
36:37 fstat(4, {st_dev=makedev(202, 1), st_ino=790228, st_mode=S_IFREG|0600, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=209715224, st_size=107374182400, st_atime=2016/02/03-07:37:15, st_mtime=2016/02/03-05:11:57, st_ctime=2016/02/03-07:36:48}) = 0 <0 .000027="">
36:37 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000030="">
36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000035="">
36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">

In above trace, you see file /home/qpadm/temp is opened in read mode as FD(file descriptor) is 4 and same FD is passed to read() call that returns 4096 Bytes incrementally. It supposed to take long time to read 101GB file[:) would never be finished using huge read calls those would have neen performed by kernel]. That's why CPU spiked to 60-80% consumed by puppet agent.

We need to check why above file is there and what is the usage of it and why it has large size.

This one is I think main culprit.

Another thing I noticed is either ovo was reinstalled or being migrated to newer version. In relation to this, I found following calls :

36:00 open("/root/ovoinstall/HPOvPacc.xml", O_RDONLY) = 4 <0 .000022="">

36:07 open("/root/ovoinstall/OVO-Agent-Migrate.xml", O_RDONLY) = 4 <0 .000032="">

36:07 fstat(4, {st_dev=makedev(202, 1), st_ino=786799, st_mode=S_IFREG|0744, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=4240, st_size=2168841, st_atime=2016/02/03-17:04:19, st_mtime=2016/01/03-22:12:54, st_ctime=2016/01/03-22:15:32}) = 0 <0 .000027="">
36:07 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000029="">
36:07 read(4, "\355\253\356\333\3\0\0\0\0\1HPOvConf-11.11.025-1\0\0"..., 4096) = 4096 <0 .000032="">
36:07 read(4, "\"Unable to copy file\"\nOVCSL_ERRO"..., 4096) = 4096 <0 .000029="">

Application team found that temp was not needed but was called in application puppet module. They disabled it and removed that file. It fixed the issue.

Wednesday, February 3, 2016

why vxdisk does throw error "VxVM vxdctl ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed To recover"

Problem:

Following errors has been encountered :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
VxVM vxdctl ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#

[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks
VxVM vxdisk ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#

Solution referred :

1. check disk :

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE TYPE DISK GROUP STATUS
cciss/c0d0 auto:none - - online invalid
fas31400_10 auto:cdsdisk flex3a_myapp_arch flex3a_arch_dg online thinrclm
fas31400_11 auto:cdsdisk flex3a_dcd_arch flex3a_arch_dg online thinrclm
fas31400_12 auto:cdsdisk flex3a_myapps_arch flex3a_arch_dg online thinrclm
fas31400_13 auto:cdsdisk flex3a_rpt_arch flex3a_arch_dg online thinrclm
fas31400_85 auto:cdsdisk myapps_pet_data00 flex3a_myapps_pet_data_dg online thinrclm
fas31400_88 auto:cdsdisk myapps_pet_data01 flex3a_myapps_pet_data_dg online thinrclm
fas31400_89 auto:cdsdisk customerumprd_arch01 flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90 auto:cdsdisk customerumprd_u02 flex3a_pet_customerumprd_dg online thinrclm
fas31400_91 auto:cdsdisk customerumprd_u01 flex3a_pet_customerumprd_dg online thinrclm
fas31400_92 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_93 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_94 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_95 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_96 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_97 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_98 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31400_99 auto:cdsdisk - (myapp_pet_data_dg) online thinrclm
fas31401_39 auto:cdsdisk myapps_pet_data02 flex3a_myapps_pet_data_dg online thinrclm
fas31402_41 auto:cdsdisk myapps_pet_data03 flex3a_myapps_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

2. Removed all disk of myapp and myapps like as it is stated above :

[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-534 Device fas31400_85: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31400_88: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31401_39: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31402_41: Device is in use
[root@customer-pet-db-3a ~]# vxdg list
NAME STATE ID
flex3a_pet_customerumprd_arch_dg enabled,cds 1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds 1376078555.40.customer-pet-db-3a.example.net
flex3a_myapps_pet_data_dg enabled,cds 1375818468.64.customer-pet-db-3c.example.net
flex3a_pet_customerumprd_dg enabled,cds 1446683837.59.customer-pet-db-3b.example.net

flex3a_myapps_pet_data_dg is still in use. Need to deport it.

[root@customer-pet-db-3a ~]# /sbin/vxdg deport flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]# vxdg list
NAME STATE ID
flex3a_pet_customerumprd_arch_dg enabled,cds 1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds 1376078555.40.customer-pet-db-3a.example.net
flex3a_pet_customerumprd_dg enabled,cds 1446683837.59.customer-pet-db-3b.example.net
[root@customer-pet-db-3a ~]#

Remove disks which were in use :

[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_93: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_94: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_95: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_96: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_97: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_98: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
fas31400_99: no such object in the configuration
[root@customer-pet-db-3a ~]#

don't worry above error as we have already removed disk of myapp. That's why above errors are displayed.

check disk now:

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE TYPE DISK GROUP STATUS
cciss/c0d0 auto:none - - online invalid
fas31400_10 auto:cdsdisk flex3a_myapp_arch flex3a_arch_dg online thinrclm
fas31400_11 auto:cdsdisk flex3a_dcd_arch flex3a_arch_dg online thinrclm
fas31400_12 auto:cdsdisk flex3a_myapps_arch flex3a_arch_dg online thinrclm
fas31400_13 auto:cdsdisk flex3a_rpt_arch flex3a_arch_dg online thinrclm
fas31400_89 auto:cdsdisk customerumprd_arch01 flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90 auto:cdsdisk customerumprd_u02 flex3a_pet_customerumprd_dg online thinrclm
fas31400_91 auto:cdsdisk customerumprd_u01 flex3a_pet_customerumprd_dg online thinrclm
[root@customer-pet-db-3a ~]#

3. Enable and Scan disk again :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks

Didn't see any more error.

4. Export dg now :

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapp_pet_data_dg import myapp_pet_data_dg
[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE TYPE DISK GROUP STATUS
cciss/c0d0 auto:none - - online invalid
fas31400_10 auto:cdsdisk flex3a_myapp_arch flex3a_arch_dg online thinrclm
fas31400_11 auto:cdsdisk flex3a_dcd_arch flex3a_arch_dg online thinrclm
fas31400_12 auto:cdsdisk flex3a_myapps_arch flex3a_arch_dg online thinrclm
fas31400_13 auto:cdsdisk flex3a_rpt_arch flex3a_arch_dg online thinrclm
fas31400_85 auto:cdsdisk - (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_88 auto:cdsdisk - (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_89 auto:cdsdisk customerumprd_arch01 flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90 auto:cdsdisk customerumprd_u02 flex3a_pet_customerumprd_dg online thinrclm
fas31400_91 auto:cdsdisk customerumprd_u01 flex3a_pet_customerumprd_dg online thinrclm
fas31400_92 auto:cdsdisk myapp_pet_data00 flex3a_myapp_pet_data_dg online thinrclm
fas31400_93 auto:cdsdisk myapp_pet_data06 flex3a_myapp_pet_data_dg online thinrclm
fas31400_94 auto:cdsdisk myapp_pet_data03 flex3a_myapp_pet_data_dg online thinrclm
fas31400_95 auto:cdsdisk myapp_pet_data07 flex3a_myapp_pet_data_dg online thinrclm
fas31400_96 auto:cdsdisk myapp_pet_data05 flex3a_myapp_pet_data_dg online thinrclm
fas31400_97 auto:cdsdisk myapp_pet_data04 flex3a_myapp_pet_data_dg online thinrclm
fas31400_98 auto:cdsdisk myapp_pet_data01 flex3a_myapp_pet_data_dg online thinrclm
fas31400_99 auto:cdsdisk myapp_pet_data02 flex3a_myapp_pet_data_dg online thinrclm
fas31401_39 auto:cdsdisk - (flex3a_myapps_pet_data_dg) online thinrclm
fas31401_40 auto:cdsdisk myapp_pet_data08 flex3a_myapp_pet_data_dg online thinrclm
fas31401_41 auto:cdsdisk myapp_pet_data11 flex3a_myapp_pet_data_dg online thinrclm
fas31401_42 auto:cdsdisk myapp_pet_data10 flex3a_myapp_pet_data_dg online thinrclm
fas31401_43 auto:cdsdisk myapp_pet_data09 flex3a_myapp_pet_data_dg online thinrclm
fas31402_41 auto:cdsdisk - (flex3a_myapps_pet_data_dg) online thinrclm
fas31402_45 auto:cdsdisk myapp_pet_data14 flex3a_myapp_pet_data_dg online thinrclm
fas31402_46 auto:cdsdisk myapp_pet_data12 flex3a_myapp_pet_data_dg online thinrclm
fas31402_47 auto:cdsdisk myapp_pet_data15 flex3a_myapp_pet_data_dg online thinrclm
fas31402_48 auto:cdsdisk myapp_pet_data13 flex3a_myapp_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

Received error for myapps,

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapps_pet_data_dg import myapps_pet_data_dg
VxVM vxdg ERROR V-5-1-10978 Disk group myapps_pet_data_dg: import failed:
No valid disk found containing disk group
[root@customer-pet-db-3a ~]#

DG is already renamed as flex3a_myapps_pet_data_dg. Just import it simply.

[root@customer-pet-db-3a ~]# vxdg import flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]#

check dg now :

[root@customer-pet-db-3a ~]# vxdg list|grep -i myapp
flex3a_myapp_pet_data_dg enabled,cds 1375815375.127.customer-pet-db-3c.example.net
flex3a_myapps_pet_data_dg enabled,cds 1375818468.64.customer-pet-db-3c.example.net
[root@customer-pet-db-3a ~]#

Confirmed with DBA, they can start DB instances.

Thursday, December 24, 2015

why did kinit throw error "Preauthentication failed while getting initial credentials"?

Problem:
While executed following command :

#kinit -k -t /root/utilscripts/nsupdateuser.keytab nsupdate@example.com

it threw error:

kinit: Preauthentication failed while getting initial credentials

Solution: Password of user may be wrong, Try to reset and test again.

Why did kinit throw error "KDC reply did not match expectations while getting initial credentials"

Problem:

While executing following command:

#kinit username@MYDOMAIN.COM -k -t username.keytab

it threw error :

kinit: KDC reply did not match expectations while getting initial credentials

Solution :

user doesn't have remote access to the machine.