Sunday, June 23, 2019

Java gateway process exited before sending its port number

Problem
While creating spark sql session, received following error message.

Exception: Java gateway process exited before sending its port number

Environment

  • OS - RHEL 7
  • Jupyter notebook


Steps to reproduce

  • configure jupyter notebook
  • Start Jupyter
  • Access Jupyter webpage
  • Run program in notebook like below


from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession


  • now click on "run"
Solution
Add following in .bashrc and restart jupyter notebook

#For jupyter notebook:
export PYSPARK_SUBMIT_ARGS="--master yarn-client pyspark-shell"
Root Cause Analysis
If you look at the code below

it says connection file is not being created in tmp directory. In my case issue was environment variable "PYSPARK_SUBMIT_ARGS" was not setup. This caused to not to create "proc"

submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
...
command = command + shlex.split(submit_args)
...
proc = Popen(command, **popen_kwargs)

...
# Wait for the file to appear, or for the process to exit, whichever happens first. while not proc.poll() and not os.path.isfile(conn_info_file): time.sleep(0.1) if not os.path.isfile(conn_info_file): raise Exception("Java gateway process exited before sending its port number")
...

Wednesday, November 8, 2017

Docker build encounters error "failed to unshare namespaces: Invalid argument"

Issue



_[91mnsenter: failed to unshare namespaces: Invalid argument
_[0m_[91mcontainer_linux.go:265: starting container process caused "process_linux.go:270: running exec setns process for init caused \"exit status 34\""
_[0moci runtime error: container_linux.go:265: starting container process caused "process_linux.go:270: running exec setns process for init caused \"exit status 34\""

Environment

  RHEL 7.2

Solution


Execute below commands as a root user and reboot system.

grubby --args="user_namespace.enable=1" --update-kernel="$(grubby --default-kernel)"
reboot

Root cause


usernamespace was enable at docker daemon level but same was not enabled in kernel level.

Docker daemon runs like this:

ExecStart=/usr/bin/dockerd --live-restore --userland-proxy=false --icc=false --disable-legacy-registry  --userns-remap=docker:docker --tlsverify --tlscacert=/var/SP/docker/certs/ca.pem --tlscert=/var/SP/docker/certs/cert.pem --tlskey=/var/SP/docker/certs/key.pem -g /var/SP/docker -H tcp://0.0.0.0:2376  -H unix://var/run/docker.sock --log-driver=syslog --log-opt syslog-address=tcp://localhost:514 --log-opt tag="DOCKER"


Docker build encounters "operation timed out"

Issue

while running docker build, we see following error.
fetch http://dl-4.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz: operation timed out
fetch http://dl-4.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz: operation timed out
ERROR: unsatisfiable constraints:

Solution 

Pass proxy as environment variable and unset at the end of Dockerfile file.
ENV http_proxy http://proxyhost.com:3128
ENV https_proxy http://proxyhost.com:3128
...
RUN unset http_proxy https_proxy

Root cause Analysis

  1. During docker build on a machine which is behind proxy server can encounter this issue. Requests do not go through  running Docker engine but go directly. hence, you need to mention proxy in Dockerfile and unset later.
  2. Second issue may happen that upstream server is not available. This is very rare. You can view with below command to see current repositories
RUN cat /etc/apk/repositories
Default values are
http://dl-cdn.alpinelinux.org/alpine/v3.4/main
http://dl-cdn.alpinelinux.org/alpine/v3.4/community

You can check through browser if you can see if domain is resolving or not. you can replace with actual domain like below
RUN sed -i 's/dl-cdn.alpinelinux.org/dl-4.alpinelinux.org/' /etc/apk/repositories




Wednesday, February 10, 2016

Why Puppet was using 96% CPU while ran as a daemon or through command prompt?

Problem :  Puppet was consuming 96% CPU while it ran as a daemon or ran as "puppet agent -t".
Environment: RHEL 6
Solution : Make sure application puppet module is tested with out any error and unwanted issue.
Root Cause Analysis: 

top has shown following:


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11403 root 20 0 259m 140m 4516 R 96.7 1.9 9:08.77 puppet
1. Traced puppet agent which was run at command prompt while I kept daemon off.
$strace -Tvv  puppet agent -t  &> ./strace_log

2. read() system calls are spawned a lot and continuously printing following messages:
[...]
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
[...]
3. While I was tracking filename of corresponding file descriptor, I found /home/qpadm/temp was opened that had size 101GB.
4. read() call was continuously trying to read data from this file and read size was 4096 Bytes in every call. Trace looked below:
22:36:37 open("/home/qpadm/temp", O_RDONLY) = 4 <0 .000032="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000027="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 fstat(4, {st_dev=makedev(202, 1), st_ino=790228, st_mode=S_IFREG|0600, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=209715224, st_size=107374182400, st_atime=2016/02/03-07:37:15, st_mtime=2016/02/03-05:11:57, st_ctime=2016/02/03-07:36:48}) = 0 <0 .000027="">
22:36:37 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000030="">
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000035="">
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
In above trace, you see file /home/qpadm/temp is opened in read mode as FD(file descriptor) is 4 and same FD is passed to read() call that returns 4096 Bytes incrementally. It supposed to take long time to read 101GB file[:) would never be finished using huge read calls those would have neen performed by kernel]. That's why CPU spiked to 60-80% consumed by puppet agent.
We need to check why above file is there and what is the usage of it and why it has large size.
This one is I think main culprit.
Another thing I noticed is either ovo was reinstalled or being migrated to newer version. In relation to this, I found following calls :
22:36:00 open("/root/ovoinstall/HPOvPacc.xml", O_RDONLY) = 4 <0 .000022="">

22:36:07 open("/root/ovoinstall/OVO-Agent-Migrate.xml", O_RDONLY) = 4 <0 .000032="">

22:36:07 fstat(4, {st_dev=makedev(202, 1), st_ino=786799, st_mode=S_IFREG|0744, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=4240, st_size=2168841, st_atime=2016/02/03-17:04:19, st_mtime=2016/01/03-22:12:54, st_ctime=2016/01/03-22:15:32}) = 0 <0 .000027="">
22:36:07 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000029="">
22:36:07 read(4, "\355\253\356\333\3\0\0\0\0\1HPOvConf-11.11.025-1\0\0"..., 4096) = 4096 <0 .000032="">
22:36:07 read(4, "\"Unable to copy file\"\nOVCSL_ERRO"..., 4096) = 4096 <0 .000029="">

Application team found that temp was not needed but was called in application puppet module. They disabled it and removed that file. It fixed the issue.


Wednesday, February 3, 2016

why vxdisk does throw error "VxVM vxdctl ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed To recover"

Problem:

Following errors has been encountered :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
VxVM vxdctl ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
        fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
        fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#


[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks
VxVM vxdisk ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
        fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
        fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#

Solution referred :


1. check disk :

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_85  auto:cdsdisk    myapps_pet_data00  flex3a_myapps_pet_data_dg online thinrclm
fas31400_88  auto:cdsdisk    myapps_pet_data01  flex3a_myapps_pet_data_dg online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
fas31400_92  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_93  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_94  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_95  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_96  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_97  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_98  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_99  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31401_39  auto:cdsdisk    myapps_pet_data02  flex3a_myapps_pet_data_dg online thinrclm
fas31402_41  auto:cdsdisk    myapps_pet_data03  flex3a_myapps_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

2. Removed all disk of myapp and myapps like as it is stated above :


[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-534 Device fas31400_85: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31400_88: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31401_39: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31402_41: Device is in use
[root@customer-pet-db-3a ~]# vxdg list
NAME         STATE           ID
flex3a_pet_customerumprd_arch_dg enabled,cds          1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds          1376078555.40.customer-pet-db-3a.example.net
flex3a_myapps_pet_data_dg enabled,cds          1375818468.64.customer-pet-db-3c.example.net
flex3a_pet_customerumprd_dg enabled,cds          1446683837.59.customer-pet-db-3b.example.net

flex3a_myapps_pet_data_dg  is still in use. Need to deport it.

[root@customer-pet-db-3a ~]# /sbin/vxdg deport flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]# vxdg list
NAME         STATE           ID
flex3a_pet_customerumprd_arch_dg enabled,cds          1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds          1376078555.40.customer-pet-db-3a.example.net
flex3a_pet_customerumprd_dg enabled,cds          1446683837.59.customer-pet-db-3b.example.net
[root@customer-pet-db-3a ~]#


Remove disks which were in use : 

[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_93: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_94: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_95: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_96: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_97: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_98: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_99: no such object in the configuration
[root@customer-pet-db-3a ~]#

don't worry above error as we have already removed disk of myapp. That's why above errors are displayed.

check disk now:

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
[root@customer-pet-db-3a ~]#

3. Enable and Scan disk again :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks

Didn't see any more error.

4. Export dg now :

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapp_pet_data_dg import myapp_pet_data_dg
[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_85  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_88  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
fas31400_92  auto:cdsdisk    myapp_pet_data00  flex3a_myapp_pet_data_dg online thinrclm
fas31400_93  auto:cdsdisk    myapp_pet_data06  flex3a_myapp_pet_data_dg online thinrclm
fas31400_94  auto:cdsdisk    myapp_pet_data03  flex3a_myapp_pet_data_dg online thinrclm
fas31400_95  auto:cdsdisk    myapp_pet_data07  flex3a_myapp_pet_data_dg online thinrclm
fas31400_96  auto:cdsdisk    myapp_pet_data05  flex3a_myapp_pet_data_dg online thinrclm
fas31400_97  auto:cdsdisk    myapp_pet_data04  flex3a_myapp_pet_data_dg online thinrclm
fas31400_98  auto:cdsdisk    myapp_pet_data01  flex3a_myapp_pet_data_dg online thinrclm
fas31400_99  auto:cdsdisk    myapp_pet_data02  flex3a_myapp_pet_data_dg online thinrclm
fas31401_39  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31401_40  auto:cdsdisk    myapp_pet_data08  flex3a_myapp_pet_data_dg online thinrclm
fas31401_41  auto:cdsdisk    myapp_pet_data11  flex3a_myapp_pet_data_dg online thinrclm
fas31401_42  auto:cdsdisk    myapp_pet_data10  flex3a_myapp_pet_data_dg online thinrclm
fas31401_43  auto:cdsdisk    myapp_pet_data09  flex3a_myapp_pet_data_dg online thinrclm
fas31402_41  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31402_45  auto:cdsdisk    myapp_pet_data14  flex3a_myapp_pet_data_dg online thinrclm
fas31402_46  auto:cdsdisk    myapp_pet_data12  flex3a_myapp_pet_data_dg online thinrclm
fas31402_47  auto:cdsdisk    myapp_pet_data15  flex3a_myapp_pet_data_dg online thinrclm
fas31402_48  auto:cdsdisk    myapp_pet_data13  flex3a_myapp_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

Received error for myapps, 

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapps_pet_data_dg import myapps_pet_data_dg
VxVM vxdg ERROR V-5-1-10978 Disk group myapps_pet_data_dg: import failed:
No valid disk found containing disk group
[root@customer-pet-db-3a ~]#

DG is already renamed as flex3a_myapps_pet_data_dg. Just import it simply.

[root@customer-pet-db-3a ~]# vxdg import flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]#


check dg now :

[root@customer-pet-db-3a ~]# vxdg list|grep -i myapp
flex3a_myapp_pet_data_dg enabled,cds          1375815375.127.customer-pet-db-3c.example.net
flex3a_myapps_pet_data_dg enabled,cds          1375818468.64.customer-pet-db-3c.example.net
[root@customer-pet-db-3a ~]#

Confirmed with DBA, they can start DB instances.


Thursday, December 24, 2015

why did kinit throw error "Preauthentication failed while getting initial credentials"?

Problem: 
While executed following command :

#kinit -k -t /root/utilscripts/nsupdateuser.keytab nsupdate@example.com

it threw error:

kinit: Preauthentication failed while getting initial credentials

Solution: Password of user may be wrong, Try to reset and test again.

Why did kinit throw error "KDC reply did not match expectations while getting initial credentials"

Problem: 

While executing following command:

#kinit username@MYDOMAIN.COM -k -t username.keytab

it threw error :

kinit: KDC reply did not match expectations while getting initial credentials

Solution : 

user doesn't have remote access to the machine.