Windows Cluster Infrastructure
De e-Ciencia
About
This page gives basic info about Microsoft Compute Cluster Server 2003 at IFCA as part of the Interactive European Grid Project[1]. Microsoft CCS 2003 is a high performance Cluster Solution. The operating system of Windows Compute Cluster Server is a special version of Windows Server 2003, namely Windows Server 2003 Compute Cluster Server Edition. It consists of only one Head Node (HN) and usually several Compute Nodes(CN). The Head Node, which can be compared to the Computing Element, holds the Scheduler and is responsible for distributing the jobs and organizing the Compute Nodes. The number of Compute Nodes is not limited and it is also possible to uses switches to separate them to different subnets.
This cluster solution is able to run both serial and parallel applications. Microsoft CCS 2003 uses MS-MPI which is based on the open source library MPICH2. It extends it in the ability to connect to Active Directory with a secure user authentication (through Kerberos). CCS 2003 runs on 64bit x86 based computers, but 32bit applications will also be executable.
Any feedback is greatly appreciated (email: fuchs [A_T] ifca.unican.es).
Cluster Configuration
The Cluster configuration here can be seen in the picture below. If necessary, more Virtual Machines (for example CN3 is a Virtual Machine) can be added and used as Compute Nodes.
Product License
*** license was removed after making the website accessible to everyone *** You can get a test license on the Microsoft website; see [2]
This is both for Head and Compute Nodes and is valid 180 days beginning on 9th of October 2006. Edit: April 2007: Begin of new installation of Servers because the licence expired. The trial licence will expire in October.
The list with the Windows login passwords has been given to Jesús Marco de Lucas.
Node Info
| Hostname | Function | Model | Processor | Freq. | RAM Memory | Hard Disk | NIC1 MAC | NIC2 MAC |
| WINHPCSERV | Head Node | HP ML 110 | Pentium 4 | 3.0 GHz | 512 MB | 75 GB | - | - |
| WINHPC01 | Comp. Node | HP ML 110 | Pentium 4 | 3.0 GHz | 512 MB | 75 GB | - | - |
| WINHPC02 | Comp. Node | HP ML 110 | Pentium 4 | 3.0 GHz | 512 MB | 75 GB | - | - |
| WINHPC03 | Comp. Node(VM) | (HP ML 110) | Pentium 4 DualCore | (1.83 GHz) | (256 MB) | (10 GB) | (virtual) | none |
| WINHPC04 | SL 4.4 VM Server | HP ML 110 | Pentium 4 DualCore | 1.83 GHz | 512 MB | 160 GB | - | none |
Installation
Standard Installation - important steps
Hardware requirements can be found here: Microsoft Technet sysreqs. Another useful document about how to install can be found here: Microsoft Technet
- Installation of Windows Server 2003 Compute Cluster Edition
- optional: automatic Installation can be made with WDS, Windows Deployment Services, which Remote Installation Services, aka RIS. In the following chapters mostly RIS will be mentioned because at that time only RIS was present, not WDS. However, all the features RIS provides are also available with WDS.
- Installation of drivers (network, chipset, graphics driver)
- Configuration of System (Firewall, Automatic Updates, etc.)
- Installation of standalone DNS or integration in existing DNS environment (Requirement for Active Directory)
- Installation of Active Directory (Requirement for Compute Cluster Pack)
- Installation of Compute Cluster Pack (to configure Head Node)
- Configuration of Cluster (Choose network topology, setting up RIS, Approve Compute Nodes etc.)
- Network topology was chosen to be the one where Compute Nodes are both on Private and on Public Network so that the access is easier and to have a separated network just for computing.
- When Using RIS you need to configure DHCP:
- Headnode got 192.168.1.1
- Compute Nodes got 192.168.1.x with x beginning from 2
- It is STRONLY recommended to assign IP-Addresses only to RIS Nodes, which are meant to be installed(MAC reservation). Otherwise also other machines which have PXE boot activated might be installed (Linux machines from other projects.)
More detailed information can be found on the net, e.g. starting from here Deploying and Managing Microsoft Windows Compute Cluster Server 2003
Hyperthreading:
In the most cases it seems that it is better to deactivate Hyperthreading windowshpc.net. But according to www.linuxclustersinstitute.org it depends a lot on your application so you need to check this yourself.
Installation problems :
- Problem with installation of Windows Server 2003 SP2: It can not be installed on the first evaluation version of MS Compute Cluster Server 2003 (see: http://windowshpc.net/forums/thread/1114.aspx). A possible solution would be, as indicated on the website, installing a new version of Windows CCS 2003, which already comes with the service pack pre-installed.
- Issues with RIS Installation
- At the end of RIS setup you are asked if you want to enable Internet Conection Sharing (ICS1) or provide the services (DHCP, DNS, Gateway). Here at IFCA keeping ICS disabled is definitively the better idea because in case of errors much more information about the problem can be retrieved.
- At first it was not possible to boot from the onboard network card. The solution was to add the most current driver for Broadcom Network Card to the RIS image. The driver files (in this case b57amd64.PNF) have to be copied to the amd64 folder. In addition to that the following folder has to be created: E:\RemoteInstall\Setup\English\Windows\$OEM$\$1\driver\Broadcom. This folder must contain the 3 driver files (b57amd64.cat, b57amd64.inf and b57amd64.inf). This folder will be used by RIS to copy the drivers to the harddisk so that they can be used in Windows mode. For further information see How to Add Third-Party OEM Network Adapters to RIS Installations
Dual Installation Windows CCS 2003 and Linux
Another good way to improve efficiency is when you make the hardware to be able to run either Windows or Linux in their cluster configuration. A very nice document about this can be found here: Dual Install - Univ. of Nebraska
Using XEN to run Windows CCS Compute Nodes in a Virtual Machine
A document with more information about this can be found in the attachments: [3]
Administration of Windows CCS 2003
Access to the nodes
The best way to connect remotely to the cluster is through rdp (command in Windows: mstsc). If you encounter problems accessing to the node remember to check if rdp is enabled[4]
File transfer
An easy way of exchanging files from and to the cluster is to connect through rdp and configure that your local drives will be mapped. Under windows this is known to work but rdesktop under linux doesn't seem to have this feature. Furthermore it is also possible to install a SSH server[5]. Note: Using the newest version didn't work but using version 1.0.10 was no problem.
Administrate Head Nodes and Compute Nodes
Day to day administration is done via the Compute Cluster Administrator MMC (Microsoft Management Console) snap-in. A short cut to the console will be automatically created in the Start-menu. Also, information about jobs can be retrieved with the Compute Cluster Job Manager. Just like the Compute Cluster Administrator this is part of the Compute Cluster Pack. It can be executed via “Start-Progams-Microsoft Compute Cluster Pack-Compute Cluster Job Manager”.
Logfiles
- ManagementService.log, C:\Program Files\Microsoft Compute Cluster Pack\LogFiles. Exists on all CNs, useful in diagnosing compute node discovery and configuration related problems such as HN could not be found, node stays in Configuring state…
- TodoList.log, C:\Program Files\Microsoft Compute Cluster Pack\LogFiles. Exists only on HN, useful in diagnosing ToDoList discovery and configuration related problems
- Binlsvc.log, C:\WINDOWS\Debug (If RIS is a problem). Exists only on HN (if RIS is installed), useful in diagnosing compute RIS related problems.
- the log file for the webserver / web interface is located in C:\Windows\system32\Logfiles\W3SVC1.
- Log file of Firewall: C:\windows\pfirewall.log
As always the Event Log is a good way to get information about the system.
A possible way to collect log files automatically from remote can be found on this page. The tool mentioned should be run on the cluster then.
Essential System Services
In case you encounter problems make sure that these services run correctly:
- MSSQL$COMPUTECLUSTER
- CCP Scheduler service (CcpScheduler.exe)
- CCP SDM Store service (CcpSdm.exe)
- CCP Management service (CcpConfiguration.exe)
- CCP Node Manager service (CcpNodeManager.exe)
- CCP MPI service (SMPD.exe)
See also [6]
Security
Automatic Updates
Every day new security vulnerabilities become known. Therefore it is necessary to install new hotfixes as soon as possible. In this cluster 'Automatic Updates' are configured to download and install updates automatically. After that the system is rebooted automatically if necessary.
Firewall
The nodes have public IP-Adresses. Therefore it is strongly recommended to install a firewall software. Here, the Windows Firewall was activated on the public interface. Although it is just a Firewall which checks incoming traffic the security of the cluster is improved. Several ports have to be opened --> see Domain group policy for details.
Group Policies
Group policies can and should be used to centrally manage and configure certain settings, e.g.: Firewall, Automatic Updates and Logon settings. A domain group policy makes sure that the web services are only accessible from the intranet. The other network interface which is connected to the internet to get always the latest updates is protected by the firewall.
Backup
It is possible to use Windows NT Backup to backup headnode and compute nodes. Third party tools (that are certified to run on Windows Server 2003) will work as well. Until now (Nov. 06) there is no failover high availability for the Head Node although Microsoft says that in future versions there might be a solution. Therefore a backup of the head node is very important in production mode, which must be restored quickly (so at least a windows backup or better a disk copy is recommended).
Using Microsoft Compute Cluster Pack Tool Pack
This tool has to be downloaded separately and includes these tools:
- Simple Cluster Monitor that will provide a visual representation of your cluster utilization
- MPI Ping Pong (mpipingpong.exe) which can be used to verify all the nodes, and the network, in your cluster are functional
- A PowerShell snap-in that provides many Compute Cluster Pack features. Note: you need to install PowerShell to use this snap-in
Usage of Compute Cluster
Install Compute Cluster Pack on your Windows Workstation
The standard way of usage is that a user has a Computer with Windows and Windows Compute Cluster Pack installed. Also the user's PC has is member of the same Active Directory Domain as the Windows CCS 2003. The User starts his Job Scheduler Console and there he configures his job. After submitting it, he has to wait until the job is finished and after that he can access the output folder with through the UNC path of the work directory which he provided when specifying the job. Apart from that the user has to be member of either the Cluster Users or the Cluster Administrators group. These are defined in Cluster Administrator.
How to submit jobs to Windows CCS 2003 Cluster
There are several ways to submit jobs. The usual ones are:
- Using the Job Scheduler, which supplies a graphical user interface. The job scheduler comes with the installation of the compute cluster pack
- Using the console, e.g.: "Job submit /jobname:JOBNAME /workdir:\\winhpcserv\batchpi /stdout:batchpOut.txt /stderr:batchpiErr.txt /numprocessors:3 mpiexec batchpi.exe 100000" (Note: the machine knows which HN to contact, because the the compute cluster pack was installed)
Integrating Windows Cluster into int.eu.grid
Important Issues
Overall this prototype has to accomplish these functions:
- submit jobs
- get job status
- forward all the necessary information to the Windows Cluster and vice versa
- accept and forward input data
- get Windows results back to CE and to User on User Interface
- first idea: pack all files in output folder to zip; name will contain jobid, put it into FTP folder and return ftp link to file
- correct handling of job IDs (until now the solution for this is a mapfile where linux job Id and Windows Job ID are associated)
- user rights / authentication
Using a Web Interface to submit jobs / Collaboration with Cornell University
At first a simple Web Interface was created by using Microsoft Visual Studio 2005 (Trial Version) and the easy to use the provided API (ccpapi). The user simply enters the website, puts his job parameters or uploads input files and sends the job. The web interface uses the ccpapi to forward all parameters to the cluster and to queue the job. The web server is usually Microsoft Internet Information Server and runs on the Head Node. For example Cornell University has already built up such a solution: [7]. After exchanging experiences they also added DYRESM [8].
Finally it was decided that this solution should not be continued because there are more promising approaches.
Integrating Microsoft Compute Cluster Server into GLite
This idea was realized by people of the Academia Sinica in Taiwan with glite version 1.4. There were still open issues but due to lack of human resource they had to suspend the project. Basically it uses a modified Computing Element which recognizes Windows jobs and forwards them to the Windows Cluster instead of to the usual cluster. It works like this:
- the user uses his usual way of sending a job, using the User Interface. The job, which is a windows job will later be executed on the windows cluster
- the job parameters are sent to the WMS. There, a job wrapper prepares the job to be send to the Computing Clement.
- on the Computing Element, the job is recognized as windows job (possibly through a flag) and therefore is sent to the Windows cluster
- pbs_submit.sh will must be modified to execute another script
- this other perl script will send the job parameters to the Web Interface which runs on the Head Node
- file upload to the cluster can possibly be done through FTP or SSH server running on the Head Node
- Sending the job to the Windows Cluster is done through a Web application, similar like the one that uses Cornell University
- at first it will be tried to omit the Resource Broker, which means jobs shall be send directly from the User Interface to the CE by using RSL (Resource specification language). When this is achieved, the more complicated things can be done, as using an existing RB.
All information received from Sinica can be found here: Media:sinica-info.tar.gz
Overview of the prototype developed at IFCA: Windows job will follow the read line
Creating the Web Interface for GRID integration
Requirements:
- Microsoft Internet Information Server
- .net Framework (64bit)
- Compute Cluster Server Libraries (available, if Compute Cluster Pack is installed?)
The following steps are necessary to get this web interface to work: (SERVER)
- Installation of IIS on HN
- IIS must be associated with installed .Net FRamework: execute command aspnet_regiis.exe –I –enable in the folder of the respective .NET Framework. For example this could be: C:\windows\Microsoft.Net\Framework64\v2.0.50727\aspnet_regiis.exe -I -enable
- Creating a web service with this tutorial: codeproject
- Adding functions to access Cluster: MS Cluster API Access
- Copying the files of the web interface into the main folder of the Website
- Installation of .Net Framework 64bit and later register the framework with the operating system (C:\windows\Microsoft.Net\Framework64\v2.0.50727\aspnet_regiis.exe -I -enable)
- Activation of 'Active Server pages' and 'ASP.NET v2.0.50727', depending on you version of ASP.NET (changes made in IIS console - Web Service Extension)
- Then IIS should be restarted (changes made in IIS console)
- Security issues, e.g only allow access to this websites from specific ip-ranges and later use of ssl.
Note that the Web Interface uses a dedicated user (hard coded). This may be not very secure but right now the only other possibility would be to forward the user credentials of the submitting user. This, however, is very complicated, if it is even possible.
- last version of Web Interface (prototype): Media:Webservice2.rar
Preparing CLIENT for Web Application
The client part of this application will use perl modules to execute the functions on the Web Interface running on the Head Node --> Media:ccs_scripts.tar.gz
Setting up FTP Service on Windows machine to make upload of workfiles possible
- Installing FTP Service through 'Add software / Windows Components / Application Server / IIS / FTP Service'
- FTP Service can be managed through the console of IIS
- installing module on CE so that perl can upload files easier (see Net::FTP::File)
- Setting up user rights:
- at first this will be realized by using a standard user which has access to the ftp folder. The problem was, that at first winhpcserv.ifca.es was both as Active Directory Domain controller and FTP Server. Because unauthorized access to a DC is a bad idea this role was transferred to winhpc01.ifca.es.
- adding security by allowing only hosts from special subnets to access the ftp site (configured in IIS settings)--> only the following subnets/hosts are allowed: 192.168.1.0/24,193.146.75.0/24,193.144.209.0/244,10.0.0.110
- to be done: adding security: e.g. using secure ftp, ...
On the Computing Element you have to prepare the following steps
- to allow perl scripts to talk to the Web service you have to install the package perl-SOAP-lite
Set up dedicated Computing Element for testing purposes
Installation of Computing Element for Windows CCS Integration
- Installation of Scientific Linux 3.08
- Installation of j2re-1.4.2_13
- stopping autoupdate (update should be done manually to make sure, nothing will be overwritten)
- Installation and configuration of ntp
- Installation of YAIM 3.0.1_15
- Preparing Site configuration file site-info.def (Media:Example-site-info.def)
- Copying scripts which will forward the job to Windows Web interface instead of pbs queuing system
- configuring iptables (based on this script: Media:firewall-iptables)
- therefore a ready-made script from Iban Cabrillo was used and adapted. This shell script has to be executed and after that you should submit the following commands: iptables-save, iptables-save > /etc/sysconfig/iptables and service iptables restart.
Configuring Computing Element to accept and forward Windows jobs
- first tests to submit job directly to dedicated CE (to use a test environment without needing to set up for example RB, BDII, LRMS...)
- the job will not reach standard scheduler (e.g pbs). Instead it will contact the Windows Web Interface and submit all necesarry information. On the Windows Site the scheduler of Compute Cluster Server will schedule the job to the existing nodes.
- finally submitting a simple job to the test CE succeeded: command --> globus-job-submit wince01.ifca.es -queue imain /bin/hostname
- a lot of problems with LCAS and LCMAPS
- until now the user's DN has to be explicitly added in grid-,mapfile. Otherwise the following error message from LCAS appears: lcas_plugin_voms-plugin_confirm_authorisation_from_x509(): authorization denied based on DN info for user
- The first idea was to modify existing pbs configuration. PBS was already installed on this CE and had only to be configured with : ./yaim -c -s /root/INSTALL/I2G/yaim/siteinfo/site-info_glite30_I2G_070424_231800_wince01.def -m glite-torque-server-config
- now jobs are added to the queue but stay queued because there is no worker node available. Next, the CE will be configured to work also as Worker node (see wince01.ifca.es/jobmanager-winhostname -queue imain /bin/hostnurces/documentation/common-issues/torque.php www.clusterresources.com). In addition you have to add the entry "$clienthost wince01.ifca.es" to $PBS_HOME/mom_priv/config.
Using a dedicated jobmanager to submit windows jobs
Another idea is to use a dedicated job-manager for windows jobs. A job manager is part of Globus GRAM (Globus Resource Allocation Manager).Please keep in mind that on this CE the Globus version is 2.4 (more info on [9]); a quite old one but this still belongs to the standard installation of a Computing Element.
Advantages for using a dedicated job-manager would be:
- the computing element could also handle usual linux jobs and the segregation between windows and linux jobs would be better so that jobs should not interfere with each other.
- still another problem is to get all necessary job parameters from user interface. With a dedicated jobmanager for each application several parameters could be hard coded. This would be an advantage in the beginning, when there are not so many applications that run on the cluster.
- for first testing purposes a new job-manager winhostname was created. Basically all the functiontality from jobmanager lcgpbs was copied, just to see if it works generally. Later this new jobmanager can be adapted. The necessary for creating jobmanager-winhostname steps were:
- the first part of the instruction on this website was used
- the second part about packaging was omitted
- basically it started with copying /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgpbs.pm to /opt/globus/lib/perl/Globus/GRAM/JobManager/winhostname.pm and changing the references inside
- next step was to create a setup script: copy the lcgpbs version to setup-globus-job-manager-winhostname.pl and change internal references
- copy of lcgpbs.rvf to winhostname.rvf (RSL validation file)
- edit /etc/globus.conf and add winhostname in the list of jobmanagers. Also create a new [gatekeeper/winhostname] entry similar to the ones that already exist
- copy /opt/globus/setup/globus/setup-globus-job-manager-lcgpbs to /opt/globus/setup/globus/setup-globus-job-manager-winhostname and change internal references
- copy /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgpbs.pm to /opt/globus/lib/perl/Globus/GRAM/JobManager/winhostname.pm and change internal references
- restart of globus-gatekeeper
- then, at least in my case, it was possible to submit jobs from a user interface with the command: globus-job-submit wince01.ifca.es/jobmanager-winhostname -queue imain /bin/hostname'
- NOTE that it might be possible that not all of these steps are absolutely necessary.
- submitting jobs from UI with this command: globus-job-submit wince01.ifca.es/jobmanager-lcgpbs -queue imain /bin/hostname
- later, after passing lcas and lcmaps, globus-job-manager will call lcgpbs.pm.
Job flow in CE
Legend: UI = User Interface, RB = Resource Broker, CE = Computing Element, GRAM = Globus Resource Allocation Manager, CCS Compute Cluster Server, WN = Worker Node respectively CN for Compute Node on Windows side,
Logfiles
The perl modules and also the jobmanager provide information about what happens exactly with the job. The jobmanager puts information into both /var/log/globus-gatekeeper.log and /home/<user>/gram*.log or name of the server. Apart from that a special directory was created which serves only for information about windows jobs. The directory is /logfiles/ and the log file is ccs_status.log
How to get Windows Job Results to User / UI
Usually the user calls globus-job-get-output from the UI to get the job results. --> Apparently this function calls the job manager fork on the Computing Node which in this case is the Computing Element as well. At the moment this still is a problem because fork polls for the job and of course does not get a correct job output (this means also that the command never stops because by default it calls again and again until a correct job response was given). So it seems that fork also has to be changed. But: Here fork uses a different job ID than when the job was created, e.g. 27034 it uses whereas before it used an id like this: 1190713815:dyresm:internal_79839633:26591.1190713814.
Ideas:
- at first only treat command line output. This could be retrieved rather easily through the Web Interface and then be written into the equivalent job output folder on the CE. This is standard, so from the it should be possible to get the result to the UI, if the command above is triggered.
- later, when dealing with bigger output files it may be possible to compact them to a zip, then copy it to a FTP accessible folder. Then the only entry in the output file would be the link to the FTP folder
Job id map file
At the moment we have to deal with both the job id from Grid and the Windows job id. This is handled with a mapping file which is located under /logfiles/jobidfile. Everytime a job is send to the Windows Cluster it's Grid job id is submitted and appears in the Job Scheduler on Windows CCS 2003. The return of the job submit function is the windows job id. Both IDs are written in the file above. Delimiter is the pipe symbol: '|'.
- to be done:
- making it work without USER DN in grid-mapfile
- installing and integrating pbs in this scenario
- sending a first windows job from grid UI to the windows cluster
- sending a complex job from UI to windows cluster
- finding a way to provide grid side with job infos from Windows cluster
configuring dedicated jobmanager
After creating and registering this new job-manager perl module called winhostname.pm we can begin with the configuration. Just like other job-managers (lcgpbs, condor) this one has to implement certain functions e.g. to submit jobs, poll for job status or cancel jobs. For more information about the class jobmanager see Globus jobmanager reference. winhostname.pm will exchange information with other perl modules named ccs_submit.pm, ccs_status.pm and ccs_cancel.pm. The names of the modules already explain their function. Module ccs_status.pm has a function that takes the internal job id from CE, searches the corresponding windows job Id in a text file and then queries the Windows Cluster for the status of this job. The result is returned to the job scheduler. These perl modules rely on SOAP (install with apt-get install perl-SOAP-lite) to be able to submit data.
Problems
- while testing suddenly not lcgpbs was used, instead fork and so the scripts from lcgpbs-pm were not executed. --> seems to coincidence with problems of restarting service globus-gatekeeper... (edg-gatekeeper dead but subsys locked) --> solution: glite-gatekeeper was running at the same time on the same port. --> deactivate it
- suddenly a simple job could not be send any more to the Windows Cluster, when executed from lcgpbs.pm. Solution --> in lcgpbs.pm the perl script is called which will invoke a job submit on the windows cluster. But also it will try to write the progress to a logfile. When called manually this is no problem, but through lcgpbs.pm another local user is used (in this case imainsgm) which had no rights to write to the logfile. Solution --> put the logfile in the home folder of the user (temporary solution)
- when restarting wince01 edg-gatekeeper does not start properly and has to be restarted --> still open
- lately globus-job-get-output on UI does not return a result. in gram log you can see that it tries to access the stdout file under the home path of the user but the file does not exist. globus-job-status says that the job is done (no error message) --> solution: due to previous issues with taking fork as job manager instead of explicitly stated lcg-pbs a change in /etc/globus.conf had been made. The default manager was set to lcg-pbs. This was wrong and by changing it back to fork the results are diplayed again when using globus-job-get-output.
- submit function in winhostname works, but it seems that the function cancel is never called (when triggered with globus-job-cancel from UI) and status () also does not return a valid result when requested from UI.
- The problem of the status function is that it seems to return the 'Done' value right away. Therefore the job has finished and the job-manager will never check the job again for its status.
- another important point is that the documentation of job-managers and the commands calling from User Interface are only partly documented. That is why it is often necessary to do Reverse Engineering to find out how things exactly work.
What worked and what didn't: As on the page mentioned above explained there are 3 basic functions of a jobmanager which need to be implemented: submit, cancel and poll. The submit function worked right away but interestingly cancel and poll didn't. From my point of understanding these other 2 functions should be called, when globus-job-status or globus-job-cancel is invoked from User Interface. However these functions seem not to get to the Computing Element. Observation: Just after submit() is called, also poll() is called. If you return Failed directly in poll then you can see on the UI that the job is still pending. It seems that if until there is no result present the message shown on UI is PENDING. Meanwhile the gram-log in the folder /home/imainsgm grows. This is explained by the fact that until there is no result this part of GRAM function seems to be called over and over again, until the job is done.
next steps
- get ccs_status.pm functions to work
- find a good way to transfer input files to windows cluster and provide results to user who uses user interface
- most probably this will be done with ftp
- create a job-manager for DYRESM
- create and test a rpm file which includes:
- jobmanager.pm files
- perl SOAP
- perl modules that exchange information with web interface (ccs_status.pm, ccs_submit.pm, ccs_cancel.pm)
- ...
- User Mapping: In heterogeneous environment with both Linux and Windows users it is possible to make a user mapping with services for Unix (SFU). Unfortunately SFU is not supported on 64bit Systems.
[obsolete] Installation of User Interface (UI) on Virtual Machine
Sending jobs from Computing Element to the Windows Web interface works. The next step would be to retrieve the correct Input and Output Sandbox information. In order to achieve this a User Interface will be installed. From there it should be possible to send jobs directly to the CE without going over a RB. The UI will be installed as a virtual machine on winhpc04.ifca.es, a Linux machine which also hosts a Windows Server 2003 CN. We have to use Scientific Linux 3.* because using the newer SL4 as host for a UI is not possible at the moment. The idea is to set a flag in the jdl file, tagging it as a Windows job, and in the Computing Element this flag will lead the job to the right Cluster (the Windows Cluster).
- Setting up VM to host UI: for configuring the VM please refer to the resources mentioned here:
- general information about XEN
- XEN VMs in int.eu.grid
- populating VM with existing image
- the user interface was successfully installed on a pre-build image which is available here
- final network configuration: set default gateway, DNS-Server and host name, deactivate automatic startup of sendmail
- Installation and configuration of User Interface
- Instruction used here: UI installation
- Installation of yaim glite-yaim-3.0.1-15 (the same as on wince01.ifca.es
- deactivate apt-autoupdate (modify the /etc/sysconfig/apt-autoupdate)
- Installation von java sdk
- with help from Rafael Marco de Lucas the UI was installed by using yaim commands: /opt/glite/yaim/bin/yaim -i -s site-info.def -m glite-UI and /opt/glite/yaim/bin/yaim -c -s /root/INSTALL/I2G/yaim/siteinfo/site-info.def -n UI
- the next step will be to try to submit a job from the UI to the CE --> this was not as easy as expected because their were problems with proxy and certificates. Instead it was possible to use another UI, i2gui01.ifca.es, to at least submit jobs to wince01.ifca.es.
Conclusion Prototype
This early prototype still lacks important features and e.g. error handling, security issues, User Authentication, User mapping between Linux and Windows etc. But it shows that it is possible to provide a single point of entry for users that want to use both Windows and Linux Cluster. However, modifying a complete jog manager is not that easy although some parts are documented. This documentation is not for people who want to do such extraordinary things like connecting the Cluster to a Windows Cluster. Also, some parts are not so well documented so you need to do reverse engineering to find out what really happens. At the moment finding suitable applications and scenarios maybe one of the main challenges for this prototype. But who knows which way the Grid and Windows CCS will go so it may be possible that in future these Clusters could complement each other more. Then, this prototype can offer valuable clues and be a basis.
Interesting might also be to try an integration with Globus Toolkit 4. This uses Web Services and this might be easier to connect to the Windows Cluster. Finally you have to admit that although the idea sounds nice it would be very important to have some users that already use the Windows Cluster.
Conclusion
Windows CCS 2003 - pros and cons:
pros
- easier and faster setup compared to grid because the software is less complex and does not consist of that many components
- (easy integration in existing Windows Environment, but at IFCA there is no centrally managed Windows Environment)
- all-in-one solution, OS, installation + administration tools, and middleware are all in one packet
cons
- until now no integrated mechanism to replace services of head node in case of a failure
- lack of MPI applications (generally speaking about Windows CCS 2003), and use of MPICH2 which at the moment is not used by many applications.
- lack of applications and users to give the cluster it's right to exist (speaking about the Cluster at the IFCA)
Outlook
- An important difference to the GRID is this: You could say that usually a Windows Cluster is distributed on a small area. For each Cluster you have on Head Node to which your clients connect to and submit jobs. The components of a Grid Cluster are usually more wide spread. For example you submit a job from one site and it is possible that it is executed on the other side of the world. At the moment Windows CCS does not provide this feature. Therefore, it depends on your needs which Cluster platform you want to use.
- the trial version of the installed Nodes will expire in October. Either the full version has to be bought or the servers have to be set up again. For converting a trial version to a full retail version just boot from the non-trial CD and select to do an upgrade from textmode.
- Also, it would be very helpful to have an application which is run by users to get feedback and to be able to improve the prototype.
Bibliography/External Links
Attachments
- Virtual Windows Compute Cluster node with XEN on Scientific Linux: Media:Docu-XenwithWinCCS.pdf

