This wiki has been deprecated and will be removed soon.

The new Advanced Computing and e-Science wiki is located at http://grid.ifca.es/wiki.

Please update your bookmarks.

Troubleshooting

De e-Ciencia

The following links should be useful:

Other problems, most of them no documented (or maybe no standard problems derived from our site configuration) are:


(Please note that the hostnames and paths should be relative to our site.)

(Please DO NOT follow this instructions blindly. You should know what are you are doing.)


Tabla de contenidos

SSH BETWEEN WNs AND CEs

DOUBLE CHECK that the WNs and CE passwordless configuration is working!

See Passwordless SSH section in Grid Administration Guide.


Job aborted with "Status Reason: Cannot plan: BrokerHelper: Problems querying the information service"

Check that the BDII has its services running. Also check that the LDAP server at 2135 port (globus-mds) on the CE is reachable by the RB. In fact, the error reported here is not true, as it is trying to contact the CE not the BDII.

Related: Troubleshooting#Cannot_contact_to_ldap_server_on_private.2Fpublic_network


Bad UID for job execution

This error is present on the client when submitting a job:

[imainsgm@i2gce01 imainsgm]$ echo "sleep 60"|qsub -q imain -u imainsgm
qsub: Bad UID for job execution

Or in the server logs with the error 15023:

08/29/2007 18:35:57;0080;PBS_Server;Req;req_reject;Reject reply code=15023(Bad UID for job execution), aux=0, type=QueueJob, from imainsgm@10.10.0.21

The solution is add the host (in our case 10.10.0.21) to /etc/hosts.equiv.

GRAM Job submission failed because the provided RSL 'queue' parameter is invalid (error code 37)

Check /opt/globus/share/globus_gram_job_manager/pbs.rvf and /opt/globus/share/globus_gram_job_manager/lcgpbs.rvf for the correct queue values.


TODO: Mirar si yaim lo hace bien.

Cannot contact to ldap server on private/public network

If you are configuring your grid-site with private/public network adresses and you cannot contact globus-mds on either the private or public address, you need to apply this patch to /etc/init.d/globus-mds:

--- /etc/init.d/globus-mds	2007-08-24 15:52:04.000000000 +0200
+++ globus-mds	2007-08-24 15:52:40.000000000 +0200
@@ -178,7 +178,8 @@
 
         if [ -z "$GLOBUS_HOSTNAME" ]; then
           if [ -x ${GLOBUS_LOCATION}/bin/globus-hostname ]; then
-            GLOBUS_HOSTNAME=`${GLOBUS_LOCATION}/bin/globus-hostname`
+#            GLOBUS_HOSTNAME=`${GLOBUS_LOCATION}/bin/globus-hostname`
+            GLOBUS_HOSTNAME=0.0.0.0
           else
             echo
             echo "Missing ${GLOBUS_LOCATION}/bin/globus-hostname - Is globus_common_config installed?"

MonBox => Not publishing accounting data

This is caused as our MON host (egeemon01) is asking for the accounting data to the CE (egeece01), instead of interrogate the site-BDII host (egeeiis01). It is caused because the MON thinks that the CE is also the site-BDII.

The problem is resolved editing /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml and changing:

<CPUProcessor>
        <GIIS>egeeiis01.ifca.es</GIIS>
</CPUProcessor>

Another workaround is launch the BDII on the CE, then launch the APEL parser and publisher. On the CE execute:

# service bdii start
# env RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-pbs-log-parser\ 
  -f /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml
# service bdii stop

Then, on the MON we are now able to publish data to the GOC:

# env RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-publisher\ 
  -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log

TODO: Check if this is caused by a yaim bug or by a misconfiguration error.

MonBox => EGEE Apel Status Report: Publishing User Name Information: No

  • Set to "yes" publishGlobalUserName attribute in the following XML files:
    • /opt/glite/etc/glite-apel-publisher/publisher-config.xml
    • /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml
  • Run Apel Publisher on the MonBox:
RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-publisher -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log

MonBox => Not publishing accounting data (Primary Producer error)

  • When running the edg-apel-publisher cron the following error code appears:
[root@i2gmon01 glite-apel-publisher]# /opt/glite/bin/apel-publisher -f /opt/glit
e/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log

Wed Jan 23 11:36:34 UTC 2008: apel-publisher - program aborted
org.glite.apel.core.ApelException: org.glite.apel.core.ApelException: org.glite.
rgma.RGMAException: Unable to locate an available Registry Service
  • First check RGMA server is running correctly:
[root@i2gmon01 root]# rgma-server-check

*** Running R-GMA server tests on i2gmon01.ifca.es ***
Checking Tomcat is running on the local machine...
Successfully connected to Tomcat.
Java VM version: 1.4.2_13 (OK)
Connecting to https://rgma-server.i2g.cesga.es:8443/R-GMA/SchemaServlet...
Successfully connected to Schema.
Using PongServlet (1) on https://rgma-server.i2g.cesga.es:8443/R-GMA/PongServlet.
Using certificate /var/lib/tomcat5/conf/hostcert.pem.
Using key /var/lib/tomcat5/conf/hostkey.pem.
Checking other servlets...
Connecting to https://i2gmon01.ifca.es:8443/R-GMA/PrimaryProducerServlet:OK
Checking clock synchronization: OK
Connecting to https://i2gmon01.ifca.es:8443/R-GMA/SecondaryProducerServlet:OK
Checking clock synchronization: OK
Connecting to https://i2gmon01.ifca.es:8443/R-GMA/OnDemandProducerServlet:OK
Checking clock synchronization: OK
Connecting to https://i2gmon01.ifca.es:8443/R-GMA/ConsumerServlet:OK
Connecting to streaming port 8088 on i2gmon01.ifca.es:OK
Checking clock synchronization: OK

*** R-GMA server test successful ***

it is, so let's take a look to the client. To do this log into an account in your User Interface and create a proxy. Next run the following command:

(In the User Interface)

[orviz@i2gui01 orviz]$ rgma-client-check 

*** Running R-GMA client tests on i2gui01.ifca.es ***

Checking C API: Failed to create producer: Unable to locate an available Registry Service
Failure - failed to insert test tuple
Checking C++ API: R-GMA application error in PrimaryProducer: Unable to locate an available Registry Service
Failure - failed to insert test tuple
Checking CommandLine API: ERROR: Unable to locate an available Registry Service
ERROR: Unable to locate an available Registry Service
Failure - failed to insert test tuple
Checking Java API: R-GMA error: Unable to locate an available Registry Service
Failure - failed to insert test tuple
Checking Python API: RGMA Error: Unable to locate an available Registry Service
Failure - failed to insert test tuple

*** R-GMA client test failed ***

This info shows us that it cannot contact with the PrimaryProducer servlet which can mean Tomcat and servlets are not running properly. Restart Tomcat on the server and it should work:

[root@i2gmon01 root]# /etc/init.d/tomcat5 restart

Globus MDS not running

This error can trigger different errors at a higher level. To resolve this, it is needed to kill the "slapd" service launched by globus-mds, then restart globus-mds. For example:

# pkill -9 -f slapd # ATTENTION: It will kill all processes that have in their "COMMAND" field the word "slapd"
# /etc/init.d/globus-mds start

Then if you check if the service is running you should get something like this:

# /etc/init.d/globus-mds status
globus-mds is running with pid 28

If not, probably you'll need to correct some things. Execute the next command to know what's wrong:

# /opt/globus/libexec/slapd -h ldap://egeemon01.ifca.es:2135 -f /opt/globus/etc/grid-info-slapd.conf -d 999 -u edginfo

and have a look at its output:

(...)
could not open config file “/opt/glue/schema/ldap/Glue-CORE.schema”

All that you need is to create this symlink:

ln -sf  /opt/glue/schema/openldap-2.0/ /opt/glue/schema/ldap

Job Submission - Several error messages

  • The following error messages in a job submission SAM test may be a result of the same issue:

Cannot read JobWrapper output, both from Condor and from Maradona or
Job got an error while in the CondorG queue or
Got a job held event, reason: Globus error 131: the user proxy expired (job is still running)

In our case it was due to the batch system misconfiguration that didn't allow certain VO-groups (like opssgm) to execute jobs. So the solution consisted of assigning an existing partition to that VO-group and restarting the scheduler (Maui in our case)

  • Submission to Condor failed

RB's job controller daemon (i2g-wl-jc) cannot be restarted because CONDOR_CONFIG environment variable was not set. This situation was caused by a bug in a profile.d script (z_i2g-profile.sh) which couldn't export the necessary environment variables.

GridIce => CE reports duplicated queues

  • Check via ldap what your CE is publishing for a certain queue
ldapsearch -H ldap://i2gce01.ifca.es:2135 -x -b "mds-vo-name=local,o=grid" | grep "GlueCEUniqueID: i2gce01.ifca.es:2119/jobmanager-lcgpbs-imain" | wc -l

If this value is greater than 1, you should check what files are in /opt/lcg/var/gip/ldif directory.

  • Go to this directory and delete all files except the ones written next:
[root@i2gce01 ldif]# ll
-rw-r--r--    1 root     root        27182 Oct 16 13:26 static-file-CE.ldif
-rw-r--r--    1 root     root        10144 Oct 16 13:26 static-file-CESEBind.ldif
-rw-r--r--    1 root     root         5673 Oct 16 13:26 static-file-Cluster.ldif
-rw-r--r--    1 root     root          717 Oct 16 13:26 static-file-Site.ldif

ldap combines every ldif file being located in /opt/lcg/var/gip/ldif directory, so if you have more than the ones showed above your CE is publishing extra data.

  • If you noticed that after deleting extra data, some of those ldif files are being created again in /opt/lcg/var/gip/ldif directory you should check if there are any process/script creating them.

In our case this script was /lcg/var/gip/provider/i2gce01.ifca.es-cache.sh which was creating an i2gce01.ifca.es-cache.ldif file inside /opt/lcg/var/gip/ldif directory. This script is not supposed to run when site_BDII is working in the same node as the CE (see /opt/glite/yaim/functions/config_gip_ce_cache). Just delete it.

GridIce => MonBox EX-GRIS is not publishing

  • When checking what the EX-GRIS node is publishing via ldap no results are obtained:
[root@cabezon root]# ldapsearch -x -h i2gmon01.ifca.es -p 2136 -b mds-vo-name=local,o=grid
version: 2

#
# filter: (objectclass=*)
# requesting: ALL
#

# local, grid
dn: Mds-Vo-name=local,o=grid
objectClass: GlobusStub

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1
  • Run fmon2glue function just to see if there is some segmentation problem:
[root@i2gmon01 root]# /opt/gridice/monitoring/bin/fmon2glue --base Mds-Vo-name=local,o=grid
Segmentation fault
  • Use strace commando to trace fmon2glue system calls:
strace /opt/gridice/monitoring/bin/fmon2glue --base Mds-Vo-name=local,o=grid

You probably see something like that:

stat64("/var/fmonServer/i2gce01.ifca.es/last.00010106", {st_mode=S_IFREG|0755, st_size=1944, ...}) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++

So we can assume that fmon2glue is failing parsing 10106 metric (/var/fmonServer/i2gce01.ifca.es/last.00010106) of our CE (i2gce01.ifca.es).

  • This metric stores bad information because of a wrong value in batch system's accounting data. For every job executed it stored an extra field, account, with this form:
account=<USER_DISTINGUISHED_NAME>
  • To avoid this situation you have to possible scenarios:
    • Quit account field from the accounting data through the JobManager.
    • Do not take into account 10106 metric. To carry out this task do the following:
      • In the problematic node, remove 10106 metric in edg-fmon-agent configuration file (/opt/edg/var/etc/edg-fmon-agent.conf) and restart edg-fmon-agent daemon.
      • Go to your collector node and remove the file:
/var/fmonServer/<problematic_node_hostname>/last.00010106

then restart edg-fmon-server process

GridIce => Classic_SE is not publishing

  • In this case the reason was a wrong URI for ldap queries, so Gris tab in GridIce portal was reporting about a failed exit code of the last query to the Gris.
  • To change this URI go to /opt/lcg/var/gip/ldif directory (in your Classic_SE node) and edit static-file-SE.ldif file replacing GlueInformationServiceURL instance with the appropriate URL. Wrong port value is the most common reason for this problem.
  • Finally restart your globus-mds daemon.

Gstat => Subcluster Information: error

  • The following error is showed in Gstat (Subcluster Information section):
GlueHostOperatingSystemName:	Scientific Linux  **ERROR** 'Scientific Linux 4.5' not found in allowed OS list below:
...
  • The parameter is published by the site BDII, but it obtains such info from the Computing Element. Check it:
[root@cabezon root]# ldapsearch -x -h egeece01.ifca.es -p 2135 -b mds-vo-name=local,o=grid | grep GlueHostOperatingSystem

objectClass: GlueHostOperatingSystem
GlueHostOperatingSystemName: Scientific Linux
GlueHostOperatingSystemRelease: 4.5
GlueHostOperatingSystemVersion: SL
  • In this particular case we should focus on /opt/glite/etc/gip/ldif/static-file-Cluster.ldif file. Open it and search for the wrong values (in our case it was the GlueHostOperatingSystemRelease tag.
  • Finally restart globus-mds process:
/etc/init.d/globus-mds restart

sBDII => Publishing 4444 waiting jobs value

  • The Site BDII is publishing:
[root@cabezon ~]# ldapsearch -x -h i2gce01.ifca.es -p 2170 -b mds-vo-name=IFCA-I2G,o=grid | grep -i wait<br>GlueCEStateWaitingJobs: 4444

GlueCEStateWaitingJobs: 4444
...
  • See first GOC Wiki - 4444 Waiting jobs in the GRIS.
  • Check if CE is able to communicate with the batch server:
    • Via MAUI commands (try with root|edginfo|rgma):
      diagnose -g --host=torque00.ifca.es
      • If this fails, check that the buggy user account name (for root case check affected hostname's server) is in /var/spool/maui/maui.cfg's ADMIN section
      • Then restart maui, killing maui process if necessary
    • Via lcg-info-dynamic-scheduler:
      /opt/glite/etc/gip/plugin/lcg-info-dynamic-scheduler-wrapper
      • If this is the case, verify that lcg-info-dynamic-scheduler.conf has the following structure (Note that this configuration is only valid for interactions with PBS):
[Main]
static_ldif_file: /opt/glite/etc/gip/ldif/static-file-CE.ldif
vomap:
<queue>:<vo>
...
module_search_path : ../lrms:../ett
[LRMS]
lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs
[Scheduler]
cycle_time : 0
vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h <maui_server>
  • Now check GRIS, error may be solved:
    [root@cabezon ~]# ldapsearch -x -h i2gce01.ifca.es -p 2170 -b mds-vo-name=IFCA-I2G,o=grid | grep -i wait
    GlueCEStateWaitingJobs: 0
    GlueCEStateWaitingJobs: 0
    ...

RB => edg_wll_JobStatus: Transport endpoint is not connected

  • When checking job status the following message appears:
[orviz@egeeui01 ~]$ edg-job-status https://egeerb01.ifca.es:9000/oOEFhxMnp0Dy2Zz7MsbH2A

**** Error: API_NATIVE_ERROR ****
Error while calling the "Status:getStatus" native api
Unable to retrieve the status for:
https://egeerb01.ifca.es:9000/oOEFhxMnp0Dy2Zz7MsbH2A
edg_wll_JobStatus: Transport endpoint is not connected

UI => voms-proxy-init: Could not establish authenticated connection with the server

  • The following error appears when trying to create a valid proxy:
[orviz@egeeui01 ~]$ voms-proxy-init --voms cms
Cannot find file or dir: /home/orviz/.glite/vomses
Enter GRID pass phrase:
Your identity: /DC=es/DC=irisgrid/O=ifca/CN=pablo-orviz
Creating temporary proxy .......................................... Done
Contacting  voms.cern.ch:15002 [/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch] "cms"
 Failed

Error: Could not establish authenticated connection with the server.
GSS Major Status: Unexpected Gatekeeper or Service Name
GSS Minor Status Error Chain:
globus_gss_assist: Error during context initialization
globus_gsi_gssapi: Authorization denied: The name of the remote entity (/DC=ch/D
C=cern/OU=computers/CN=voms.cern.ch), and the expected name for the remote entit
y (/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch) do not match
  • This error is related to an error on the configuration of this VO in the User Interface:
    • Go to /opt/glite/etc/vomses directory, look for the files that contain this VO name and check its right DN/CA
    • Try again the voms-proxy-init command

UI => voms-proxy-init: globus_gss_assist token :-1: read failure: unknown

  • The following error appears when trying to create a valid proxy:
[orviz@egeeui01 ~]$ voms-proxy-init --voms cms
Cannot find file or dir: /home/orviz/.glite/vomses
Enter GRID pass phrase:
Your identity: /DC=es/DC=irisgrid/O=ifca/CN=pablo-orviz
Creating temporary proxy .......................................... Done
Contacting  voms.cern.ch:15002 [/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch] "cms"
 Failed

Error: Could not establish authenticated connection with the server.
    globus_gss_assist token :-1: read failure: unknown

None of the contacted servers for dteam were capable
of returning a valid AC for the user.
  • First of all check if your system date is OK (it is a common error with multiple symptoms).
  • If not, see VOMS core troubleshooting for more information about this issue.

UI => voms-proxy-init: globus_gsi_callback module: Invalid CRL: The available CRL has expired

  • Something wrong with fetch-crl cron process in your User Interface. Just take a look to this cron process's log (fetch-crl) and you will probably see some error during CRL download
  • To solve it just run fetch-crl cron process again:
/opt/glite/libexec/fetch-crl.sh >> /var/log/fetch-crl-cron.log

UI => globus-job-run: Globus error code 93

  • Error:
[orviz@egeeui01 orviz]$ globus-job-run egeece01.ifca.es:2119/jobmanager-lcgpbs-cms HelloWorld.jdl 

GRAM Job submission failed because the gatekeeper failed to find the requested service (error code 93)
  • Bad syntaxis of the command globus-job-run.

CMS: Montecarlo SAM test error (CE-cms-mc)

When this test fails with the error code:

send2nsd: NS009 - fatal configuration error: Host unknown: dpnshome.ifca.es

it's because DPNS_HOME is not correctly set. It has to point to your DPM node (in /etc/profile.d/grid-env.sh), like this:

DPNS_HOME=dpm01.ifca.es

MAUI/Torque: account table overflow error

  • In maui.ck file you can check if there was any inconsistence in the accounting data stored.
  • If you see any, just copy it to another location and restart maui service.

PX => Proxy delegation problem

  • When trying to create a proxy with myproxy-init command on the client (usually an UI):
[orviz@egeeui01 ~]$ myproxy-init -d -s egeepx01.ifca.es
[orviz@egeeui01 ~]$ myproxy-get-delegation -d -s egeepx01.ifca.es
..
Enter MyProxy pass phrase:
Failed to receive credentials.
ERROR from myproxy-server (egeepx01.ifca.es):
"<anonymous>" not authorized by server's authorized_retriever policy
..

And on the server (egeepx01.ifca.es), you get: .. Jun 2 12:42:25 egeepx01 myproxy-server: <4001> Connection from 193.146.75.26 Jun 2 12:42:27 egeepx01 myproxy-server: <25762> Authenticated client <anonymous> Jun 2 12:42:27 egeepx01 myproxy-server: <25762> authorization failed Jun 2 12:42:27 egeepx01 myproxy-server: <25762> Exiting: "<anonymous>" not authorized by server's authorized_retriever policy ..

  • To deal this situation, follow the steps:
    • Enable delegation to every one (not very secure, though): /opt/globus/etc/myproxy-server.config
accepted_credentials  "*"
authorized_retrievers "*"
default_retrievers    "*"
authorized_renewers   "*"
#default_renewers      "none"
    • Copy this file to /etc/myproxy-server.config.
    • Comment the following lines in /etc/init.d/myproxy daemon file:
..
MKCONFIG="/etc/rc.d/init.d/myproxy-generate-config.pl $CERTDIR $X509_USER_CERT $EDG_LOCATION/etc/edg-myproxy.conf $CONFIG"
..
. ${GLOBUS_LOCATION}/libexec/globus-script-initializer
. ${libexecdir}/globus-sh-tools.sh
..
    • Restart the myproxy daemon:

WN => NFS/GPFS hangout

  • Due to some cron jobs, the WNs where failing at random times. The problem was caused by the updatedb program, trying to access the

GPFS directories. Problem solved by including "gpfs" on the $PRUNEFS variable. Also we have included "/home" to the $PRUNEPATHS, to avoid the indexing of the useless home directories.

  • Removed logwatch from the system.

CE => 10 data transfer to the server failed

  • This error can be related to VOMS certificates expiration on the CE/RB. So check their availabiilty on /etc/grid-security/vomses/* and that you have installed the last lcg-vomscerts package version. Also, if you want to avoid having to install the right VOMS certificates every time they change, you can configure your machine in the following way:
 https://twiki.cern.ch/twiki//bin/view/LCG/VomsFAQforServiceManagers#How_to_get_rid_of_the_whole_host
  • We also got this message while having an I/O error in /home partition of the CEs

CE => GRAM Job submission failed because the provided RSL 'queue' parameter is invalid (error code 37)

  • Check if all the queues are set in:
/opt/globus/share/globus_gram_job_manager/lcgpbs.rvf


VOMS server => VO user request expires within first 15 minutes

  • Solution:

Modify voms.request.vo_membership.lifetime parameter from /var/glite/etc/voms-admin/<vo>/voms.service.properties to:

voms.request.vo_membership.lifetime = 86400

Then restart affected VO

VOMS server => Unable to verify signature! Server certificate possibly not installed

  • This message appears when doing a voms-proxy-info -all in the UI
  • It is caused by a wrong configuration for the VO in the VOMS server. Just set the VO port in the "--uri" parameter in the VO's voms.conf. For instance:
...
--uri=voms01.ifca.es:15002
...

YUM's Java missing dependency: Dependency: jdk = 2000:1.5.0_14-fcs is needed by > package java-1.5.0-sun-compat

  • See:

https://twiki.cern.ch/twiki/bin/view/EGEE/GLite31JPackage#Option_1a_Installing_JPackage_s

i2glogin listening on an private network interface

Set $GLOBUS_HOSTNAME to the proper hostname.

globus-job* stopped working

After configuring MPI support, globus-job* are likely to stop working. This is because on /opt/globus/share/globus_gram_job_manager/globus-gram-job-manager.rvf the Attribute: job_type is set by default to multiple. Change it to sigle as follows:

Attribute: job_type
Description: "This specifies how the jobmanager should start the job.
              Possible values are single (even if the count > 1, only start
          1 process or thread), multiple (start count processes or threads),
          mpi (use the appropriate method (e.g. mpirun) to start a program
          compiled with a vendor-provided MPI library. Program is started
          with count nodes), and condor (starts condor jobs in the
          \"condor\" universe.)"
Values: single multiple mpi condor
Default: single
ValidWhen: GLOBUS_GRAM_JOB_SUBMIT
DefaultWhen: GLOBUS_GRAM_JOB_SUBMIT
Grid Administration
Users Support