Troubleshooting
De e-Ciencia
The following links should be useful:
- Goc Wiki.
- Documentación CNAF.
- INFN CNAF Knowledge base.
- Bologna LCG Administrator's guide with interesing problems and its solutions.
Other problems, most of them no documented (or maybe no standard problems derived from our site configuration) are:
(Please note that the hostnames and paths should be relative to our site.)
(Please DO NOT follow this instructions blindly. You should know what are you are doing.)
SSH BETWEEN WNs AND CEs
DOUBLE CHECK that the WNs and CE passwordless configuration is working!
See Passwordless SSH section in Grid Administration Guide.
Job aborted with "Status Reason: Cannot plan: BrokerHelper: Problems querying the information service"
Check that the BDII has its services running. Also check that the LDAP server at 2135 port (globus-mds) on the CE is reachable by the RB. In fact, the error reported here is not true, as it is trying to contact the CE not the BDII.
Related: Troubleshooting#Cannot_contact_to_ldap_server_on_private.2Fpublic_network
Bad UID for job execution
This error is present on the client when submitting a job:
[imainsgm@i2gce01 imainsgm]$ echo "sleep 60"|qsub -q imain -u imainsgm qsub: Bad UID for job execution
Or in the server logs with the error 15023:
08/29/2007 18:35:57;0080;PBS_Server;Req;req_reject;Reject reply code=15023(Bad UID for job execution), aux=0, type=QueueJob, from imainsgm@10.10.0.21
The solution is add the host (in our case 10.10.0.21) to /etc/hosts.equiv.
GRAM Job submission failed because the provided RSL 'queue' parameter is invalid (error code 37)
Check /opt/globus/share/globus_gram_job_manager/pbs.rvf and /opt/globus/share/globus_gram_job_manager/lcgpbs.rvf for the correct queue values.
TODO: Mirar si yaim lo hace bien.
Cannot contact to ldap server on private/public network
If you are configuring your grid-site with private/public network adresses and you cannot contact globus-mds on either the private or public address, you need to apply this patch to /etc/init.d/globus-mds:
--- /etc/init.d/globus-mds 2007-08-24 15:52:04.000000000 +0200
+++ globus-mds 2007-08-24 15:52:40.000000000 +0200
@@ -178,7 +178,8 @@
if [ -z "$GLOBUS_HOSTNAME" ]; then
if [ -x ${GLOBUS_LOCATION}/bin/globus-hostname ]; then
- GLOBUS_HOSTNAME=`${GLOBUS_LOCATION}/bin/globus-hostname`
+# GLOBUS_HOSTNAME=`${GLOBUS_LOCATION}/bin/globus-hostname`
+ GLOBUS_HOSTNAME=0.0.0.0
else
echo
echo "Missing ${GLOBUS_LOCATION}/bin/globus-hostname - Is globus_common_config installed?"
MonBox => Not publishing accounting data
This is caused as our MON host (egeemon01) is asking for the accounting data to the CE (egeece01), instead of interrogate the site-BDII host (egeeiis01). It is caused because the MON thinks that the CE is also the site-BDII.
The problem is resolved editing /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml and changing:
<CPUProcessor>
<GIIS>egeeiis01.ifca.es</GIIS>
</CPUProcessor>
Another workaround is launch the BDII on the CE, then launch the APEL parser and publisher. On the CE execute:
# service bdii start # env RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-pbs-log-parser\ -f /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml # service bdii stop
Then, on the MON we are now able to publish data to the GOC:
# env RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-publisher\ -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log
TODO: Check if this is caused by a yaim bug or by a misconfiguration error.
MonBox => EGEE Apel Status Report: Publishing User Name Information: No
- Set to "yes"
publishGlobalUserNameattribute in the following XML files:-
/opt/glite/etc/glite-apel-publisher/publisher-config.xml -
/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml
-
- Run Apel Publisher on the MonBox:
RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-publisher -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log
MonBox => Not publishing accounting data (Primary Producer error)
- When running the
edg-apel-publishercron the following error code appears:
[root@i2gmon01 glite-apel-publisher]# /opt/glite/bin/apel-publisher -f /opt/glit e/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log Wed Jan 23 11:36:34 UTC 2008: apel-publisher - program aborted org.glite.apel.core.ApelException: org.glite.apel.core.ApelException: org.glite. rgma.RGMAException: Unable to locate an available Registry Service
- First check RGMA server is running correctly:
[root@i2gmon01 root]# rgma-server-check *** Running R-GMA server tests on i2gmon01.ifca.es *** Checking Tomcat is running on the local machine... Successfully connected to Tomcat. Java VM version: 1.4.2_13 (OK) Connecting to https://rgma-server.i2g.cesga.es:8443/R-GMA/SchemaServlet... Successfully connected to Schema. Using PongServlet (1) on https://rgma-server.i2g.cesga.es:8443/R-GMA/PongServlet. Using certificate /var/lib/tomcat5/conf/hostcert.pem. Using key /var/lib/tomcat5/conf/hostkey.pem. Checking other servlets... Connecting to https://i2gmon01.ifca.es:8443/R-GMA/PrimaryProducerServlet:OK Checking clock synchronization: OK Connecting to https://i2gmon01.ifca.es:8443/R-GMA/SecondaryProducerServlet:OK Checking clock synchronization: OK Connecting to https://i2gmon01.ifca.es:8443/R-GMA/OnDemandProducerServlet:OK Checking clock synchronization: OK Connecting to https://i2gmon01.ifca.es:8443/R-GMA/ConsumerServlet:OK Connecting to streaming port 8088 on i2gmon01.ifca.es:OK Checking clock synchronization: OK *** R-GMA server test successful ***
it is, so let's take a look to the client. To do this log into an account in your User Interface and create a proxy. Next run the following command:
(In the User Interface) [orviz@i2gui01 orviz]$ rgma-client-check *** Running R-GMA client tests on i2gui01.ifca.es *** Checking C API: Failed to create producer: Unable to locate an available Registry Service Failure - failed to insert test tuple Checking C++ API: R-GMA application error in PrimaryProducer: Unable to locate an available Registry Service Failure - failed to insert test tuple Checking CommandLine API: ERROR: Unable to locate an available Registry Service ERROR: Unable to locate an available Registry Service Failure - failed to insert test tuple Checking Java API: R-GMA error: Unable to locate an available Registry Service Failure - failed to insert test tuple Checking Python API: RGMA Error: Unable to locate an available Registry Service Failure - failed to insert test tuple *** R-GMA client test failed ***
This info shows us that it cannot contact with the PrimaryProducer servlet which can mean Tomcat and servlets are not running properly. Restart Tomcat on the server and it should work:
[root@i2gmon01 root]# /etc/init.d/tomcat5 restart
Globus MDS not running
This error can trigger different errors at a higher level. To resolve this, it is needed to kill the "slapd" service launched by globus-mds, then restart globus-mds. For example:
# pkill -9 -f slapd # ATTENTION: It will kill all processes that have in their "COMMAND" field the word "slapd" # /etc/init.d/globus-mds start
Then if you check if the service is running you should get something like this:
# /etc/init.d/globus-mds status globus-mds is running with pid 28
If not, probably you'll need to correct some things. Execute the next command to know what's wrong:
# /opt/globus/libexec/slapd -h ldap://egeemon01.ifca.es:2135 -f /opt/globus/etc/grid-info-slapd.conf -d 999 -u edginfo
and have a look at its output:
(...) could not open config file “/opt/glue/schema/ldap/Glue-CORE.schema”
All that you need is to create this symlink:
ln -sf /opt/glue/schema/openldap-2.0/ /opt/glue/schema/ldap
Job Submission - Several error messages
- The following error messages in a job submission SAM test may be a result of the same issue:
Cannot read JobWrapper output, both from Condor and from Maradona or
Job got an error while in the CondorG queue or
Got a job held event, reason: Globus error 131: the user proxy expired (job is still running)
In our case it was due to the batch system misconfiguration that didn't allow certain VO-groups (like opssgm) to execute jobs. So the solution consisted of assigning an existing partition to that VO-group and restarting the scheduler (Maui in our case)
- Submission to Condor failed
RB's job controller daemon (i2g-wl-jc) cannot be restarted because CONDOR_CONFIG environment variable was not set. This situation was caused by a bug in a profile.d script (z_i2g-profile.sh) which couldn't export the necessary environment variables.
GridIce => CE reports duplicated queues
- Check via
ldapwhat your CE is publishing for a certain queue
ldapsearch -H ldap://i2gce01.ifca.es:2135 -x -b "mds-vo-name=local,o=grid" | grep "GlueCEUniqueID: i2gce01.ifca.es:2119/jobmanager-lcgpbs-imain" | wc -l
If this value is greater than 1, you should check what files are in /opt/lcg/var/gip/ldif directory.
- Go to this directory and delete all files except the ones written next:
[root@i2gce01 ldif]# ll -rw-r--r-- 1 root root 27182 Oct 16 13:26 static-file-CE.ldif -rw-r--r-- 1 root root 10144 Oct 16 13:26 static-file-CESEBind.ldif -rw-r--r-- 1 root root 5673 Oct 16 13:26 static-file-Cluster.ldif -rw-r--r-- 1 root root 717 Oct 16 13:26 static-file-Site.ldif
ldap combines every ldif file being located in /opt/lcg/var/gip/ldif directory, so if you have more than the ones showed above your CE is publishing extra data.
- If you noticed that after deleting extra data, some of those ldif files are being created again in
/opt/lcg/var/gip/ldifdirectory you should check if there are any process/script creating them.
In our case this script was /lcg/var/gip/provider/i2gce01.ifca.es-cache.sh which was creating an i2gce01.ifca.es-cache.ldif file inside /opt/lcg/var/gip/ldif directory. This script is not supposed to run when site_BDII is working in the same node as the CE (see /opt/glite/yaim/functions/config_gip_ce_cache). Just delete it.
GridIce => MonBox EX-GRIS is not publishing
- When checking what the EX-GRIS node is publishing via
ldapno results are obtained:
[root@cabezon root]# ldapsearch -x -h i2gmon01.ifca.es -p 2136 -b mds-vo-name=local,o=grid version: 2 # # filter: (objectclass=*) # requesting: ALL # # local, grid dn: Mds-Vo-name=local,o=grid objectClass: GlobusStub # search result search: 2 result: 0 Success # numResponses: 2 # numEntries: 1
- Run
fmon2gluefunction just to see if there is some segmentation problem:
[root@i2gmon01 root]# /opt/gridice/monitoring/bin/fmon2glue --base Mds-Vo-name=local,o=grid Segmentation fault
- Use
stracecommando to tracefmon2gluesystem calls:
strace /opt/gridice/monitoring/bin/fmon2glue --base Mds-Vo-name=local,o=grid
You probably see something like that:
stat64("/var/fmonServer/i2gce01.ifca.es/last.00010106", {st_mode=S_IFREG|0755, st_size=1944, ...}) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++
So we can assume that fmon2glue is failing parsing 10106 metric (/var/fmonServer/i2gce01.ifca.es/last.00010106) of our CE (i2gce01.ifca.es).
- This metric stores bad information because of a wrong value in batch system's accounting data. For every job executed it stored an extra field,
account, with this form:
account=<USER_DISTINGUISHED_NAME>
- To avoid this situation you have to possible scenarios:
- Quit
accountfield from the accounting data through the JobManager. - Do not take into account 10106 metric. To carry out this task do the following:
- In the problematic node, remove 10106 metric in
edg-fmon-agentconfiguration file (/opt/edg/var/etc/edg-fmon-agent.conf) and restartedg-fmon-agentdaemon. - Go to your collector node and remove the file:
- In the problematic node, remove 10106 metric in
- Quit
/var/fmonServer/<problematic_node_hostname>/last.00010106
then restart edg-fmon-server process
GridIce => Classic_SE is not publishing
- In this case the reason was a wrong URI for
ldapqueries, soGristab in GridIce portal was reporting about a failed exit code of the last query to the Gris. - To change this URI go to
/opt/lcg/var/gip/ldifdirectory (in your Classic_SE node) and editstatic-file-SE.ldiffile replacingGlueInformationServiceURLinstance with the appropriate URL. Wrong port value is the most common reason for this problem. - Finally restart your
globus-mdsdaemon.
Gstat => Subcluster Information: error
- The following error is showed in Gstat (Subcluster Information section):
GlueHostOperatingSystemName: Scientific Linux **ERROR** 'Scientific Linux 4.5' not found in allowed OS list below: ...
- The parameter is published by the site BDII, but it obtains such info from the Computing Element. Check it:
[root@cabezon root]# ldapsearch -x -h egeece01.ifca.es -p 2135 -b mds-vo-name=local,o=grid | grep GlueHostOperatingSystem objectClass: GlueHostOperatingSystem GlueHostOperatingSystemName: Scientific Linux GlueHostOperatingSystemRelease: 4.5 GlueHostOperatingSystemVersion: SL
- In this particular case we should focus on
/opt/glite/etc/gip/ldif/static-file-Cluster.ldiffile. Open it and search for the wrong values (in our case it was theGlueHostOperatingSystemReleasetag.
- Finally restart
globus-mdsprocess:
/etc/init.d/globus-mds restart
sBDII => Publishing 4444 waiting jobs value
- The Site BDII is publishing:
[root@cabezon ~]# ldapsearch -x -h i2gce01.ifca.es -p 2170 -b mds-vo-name=IFCA-I2G,o=grid | grep -i wait<br>GlueCEStateWaitingJobs: 4444 GlueCEStateWaitingJobs: 4444 ...
- See first GOC Wiki - 4444 Waiting jobs in the GRIS.
- Check if CE is able to communicate with the batch server:
- Via MAUI commands (try with
root|edginfo|rgma):diagnose -g --host=torque00.ifca.es- If this fails, check that the buggy user account name (for
rootcase check affected hostname's server) is in/var/spool/maui/maui.cfg's ADMIN section - Then restart
maui, killing maui process if necessary
- If this fails, check that the buggy user account name (for
- Via
lcg-info-dynamic-scheduler:/opt/glite/etc/gip/plugin/lcg-info-dynamic-scheduler-wrapper- If this is the case, verify that
lcg-info-dynamic-scheduler.confhas the following structure (Note that this configuration is only valid for interactions with PBS):
- If this is the case, verify that
- Via MAUI commands (try with
[Main] static_ldif_file: /opt/glite/etc/gip/ldif/static-file-CE.ldif vomap: <queue>:<vo> ... module_search_path : ../lrms:../ett [LRMS] lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs [Scheduler] cycle_time : 0 vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h <maui_server>
- Now check GRIS, error may be solved:
[root@cabezon ~]# ldapsearch -x -h i2gce01.ifca.es -p 2170 -b mds-vo-name=IFCA-I2G,o=grid | grep -i wait
GlueCEStateWaitingJobs: 0
GlueCEStateWaitingJobs: 0
...
RB => edg_wll_JobStatus: Transport endpoint is not connected
- When checking job status the following message appears:
[orviz@egeeui01 ~]$ edg-job-status https://egeerb01.ifca.es:9000/oOEFhxMnp0Dy2Zz7MsbH2A **** Error: API_NATIVE_ERROR **** Error while calling the "Status:getStatus" native api Unable to retrieve the status for: https://egeerb01.ifca.es:9000/oOEFhxMnp0Dy2Zz7MsbH2A edg_wll_JobStatus: Transport endpoint is not connected
UI => voms-proxy-init: Could not establish authenticated connection with the server
- The following error appears when trying to create a valid proxy:
[orviz@egeeui01 ~]$ voms-proxy-init --voms cms Cannot find file or dir: /home/orviz/.glite/vomses Enter GRID pass phrase: Your identity: /DC=es/DC=irisgrid/O=ifca/CN=pablo-orviz Creating temporary proxy .......................................... Done Contacting voms.cern.ch:15002 [/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch] "cms" Failed Error: Could not establish authenticated connection with the server. GSS Major Status: Unexpected Gatekeeper or Service Name GSS Minor Status Error Chain: globus_gss_assist: Error during context initialization globus_gsi_gssapi: Authorization denied: The name of the remote entity (/DC=ch/D C=cern/OU=computers/CN=voms.cern.ch), and the expected name for the remote entit y (/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch) do not match
- This error is related to an error on the configuration of this VO in the User Interface:
- Go to
/opt/glite/etc/vomsesdirectory, look for the files that contain this VO name and check its right DN/CA - Try again the
voms-proxy-initcommand
- Go to
UI => voms-proxy-init: globus_gss_assist token :-1: read failure: unknown
- The following error appears when trying to create a valid proxy:
[orviz@egeeui01 ~]$ voms-proxy-init --voms cms
Cannot find file or dir: /home/orviz/.glite/vomses
Enter GRID pass phrase:
Your identity: /DC=es/DC=irisgrid/O=ifca/CN=pablo-orviz
Creating temporary proxy .......................................... Done
Contacting voms.cern.ch:15002 [/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch] "cms"
Failed
Error: Could not establish authenticated connection with the server.
globus_gss_assist token :-1: read failure: unknown
None of the contacted servers for dteam were capable
of returning a valid AC for the user.
- First of all check if your system date is OK (it is a common error with multiple symptoms).
- If not, see VOMS core troubleshooting for more information about this issue.
UI => voms-proxy-init: globus_gsi_callback module: Invalid CRL: The available CRL has expired
- Something wrong with
fetch-crlcron process in your User Interface. Just take a look to this cron process's log (fetch-crl) and you will probably see some error during CRL download - To solve it just run
fetch-crlcron process again:
/opt/glite/libexec/fetch-crl.sh >> /var/log/fetch-crl-cron.log
UI => globus-job-run: Globus error code 93
- Error:
[orviz@egeeui01 orviz]$ globus-job-run egeece01.ifca.es:2119/jobmanager-lcgpbs-cms HelloWorld.jdl GRAM Job submission failed because the gatekeeper failed to find the requested service (error code 93)
- Bad syntaxis of the command
globus-job-run.
CMS: Montecarlo SAM test error (CE-cms-mc)
When this test fails with the error code:
send2nsd: NS009 - fatal configuration error: Host unknown: dpnshome.ifca.es
it's because DPNS_HOME is not correctly set. It has to point to your DPM node (in /etc/profile.d/grid-env.sh), like this:
DPNS_HOME=dpm01.ifca.es
MAUI/Torque: account table overflow error
- In
maui.ckfile you can check if there was any inconsistence in the accounting data stored.
- If you see any, just copy it to another location and restart
mauiservice.
PX => Proxy delegation problem
- When trying to create a proxy with
myproxy-initcommand on the client (usually an UI):
[orviz@egeeui01 ~]$ myproxy-init -d -s egeepx01.ifca.es [orviz@egeeui01 ~]$ myproxy-get-delegation -d -s egeepx01.ifca.es .. Enter MyProxy pass phrase: Failed to receive credentials. ERROR from myproxy-server (egeepx01.ifca.es): "<anonymous>" not authorized by server's authorized_retriever policy ..
And on the server (egeepx01.ifca.es), you get:
..
Jun 2 12:42:25 egeepx01 myproxy-server: <4001> Connection from 193.146.75.26
Jun 2 12:42:27 egeepx01 myproxy-server: <25762> Authenticated client <anonymous>
Jun 2 12:42:27 egeepx01 myproxy-server: <25762> authorization failed
Jun 2 12:42:27 egeepx01 myproxy-server: <25762> Exiting: "<anonymous>" not authorized by server's authorized_retriever policy
..
- To deal this situation, follow the steps:
- Enable delegation to every one (not very secure, though):
/opt/globus/etc/myproxy-server.config
- Enable delegation to every one (not very secure, though):
accepted_credentials "*" authorized_retrievers "*" default_retrievers "*" authorized_renewers "*" #default_renewers "none"
- Copy this file to
/etc/myproxy-server.config. - Comment the following lines in
/etc/init.d/myproxydaemon file:
- Copy this file to
..
MKCONFIG="/etc/rc.d/init.d/myproxy-generate-config.pl $CERTDIR $X509_USER_CERT $EDG_LOCATION/etc/edg-myproxy.conf $CONFIG"
..
. ${GLOBUS_LOCATION}/libexec/globus-script-initializer
. ${libexecdir}/globus-sh-tools.sh
..
- Restart the
myproxydaemon:
- Restart the
WN => NFS/GPFS hangout
- Due to some cron jobs, the WNs where failing at random times. The problem was caused by the
updatedbprogram, trying to access the
GPFS directories. Problem solved by including "gpfs" on the $PRUNEFS variable. Also we have included "/home" to the
$PRUNEPATHS, to avoid the indexing of the useless home directories.
- Removed logwatch from the system.
CE => 10 data transfer to the server failed
- This error can be related to VOMS certificates expiration on the CE/RB. So check their availabiilty on
/etc/grid-security/vomses/*and that you have installed the lastlcg-vomscertspackage version. Also, if you want to avoid having to install the right VOMS certificates every time they change, you can configure your machine in the following way:
https://twiki.cern.ch/twiki//bin/view/LCG/VomsFAQforServiceManagers#How_to_get_rid_of_the_whole_host
- We also got this message while having an I/O error in
/homepartition of the CEs
CE => GRAM Job submission failed because the provided RSL 'queue' parameter is invalid (error code 37)
- Check if all the queues are set in:
/opt/globus/share/globus_gram_job_manager/lcgpbs.rvf
VOMS server => VO user request expires within first 15 minutes
- See GGUS ticket #37328: https://gus.fzk.de/ws/ticket_info.php?ticket=37328&from=search
- Solution:
Modify voms.request.vo_membership.lifetime parameter from /var/glite/etc/voms-admin/<vo>/voms.service.properties to:
voms.request.vo_membership.lifetime = 86400
Then restart affected VO
VOMS server => Unable to verify signature! Server certificate possibly not installed
- See Savannah bug: https://savannah.cern.ch/bugs/?36052
- This message appears when doing a
voms-proxy-info -allin the UI
- It is caused by a wrong configuration for the VO in the VOMS server. Just set the VO port in the "--uri" parameter in the VO's voms.conf. For instance:
... --uri=voms01.ifca.es:15002 ...
YUM's Java missing dependency: Dependency: jdk = 2000:1.5.0_14-fcs is needed by > package java-1.5.0-sun-compat
- See:
https://twiki.cern.ch/twiki/bin/view/EGEE/GLite31JPackage#Option_1a_Installing_JPackage_s
i2glogin listening on an private network interface
Set $GLOBUS_HOSTNAME to the proper hostname.
globus-job* stopped working
After configuring MPI support, globus-job* are likely to stop working. This is because on /opt/globus/share/globus_gram_job_manager/globus-gram-job-manager.rvf the Attribute: job_type is set by default to multiple. Change it to sigle as follows:
Attribute: job_type
Description: "This specifies how the jobmanager should start the job.
Possible values are single (even if the count > 1, only start
1 process or thread), multiple (start count processes or threads),
mpi (use the appropriate method (e.g. mpirun) to start a program
compiled with a vendor-provided MPI library. Program is started
with count nodes), and condor (starts condor jobs in the
\"condor\" universe.)"
Values: single multiple mpi condor
Default: single
ValidWhen: GLOBUS_GRAM_JOB_SUBMIT
DefaultWhen: GLOBUS_GRAM_JOB_SUBMIT
