Frontend backend gridftp Users Batch system Local Users Site concept.

frontend backend gridftp Users Batch system Local Users Site concept

StoRM+gpfs allows the build-up of an integrated and scalable system Each external certified user can access the informations on the site through StoRM, which gives a secure portal. They can anyway submit to local resources. Local users can access the storage like having local mounted units, avoiding net protocols, overheads and system load A list of recommended settings follows

Network Layering Storage headnodes base layer core layer public layer Application servers Public servers

Milan Tier2 has been spread in two rooms, to have full cooperation between the two pool during normal operations and give a failover in case of disaster recovery. Network is layered. Base Level. A cross lan to connect all the gpfs servers gpfs. Each machine involved has an interface on it. Core Level. Both rooms have two different lans, and they are unrelated Public Level. External visible devices. All public services are balanced: gridftp, UI, frontend. They answer to a virtual ip

Clustering Cluster master #1 UI Batch pool Cluster master #2 StoRM filesystem s

GPFS Several slave clusters have been created to steer separately their attributes and to control the visibility of filesystems in an organized way Point of Failure Each cluster has two configurator machines. They must be phisically separated. Quorum must be always built with an odd machine amount, all unrelated Stable, secure and affordable. Gpfs tolerate failures on a machine or a maintenance on the fly on disks, without visible effect Pagepool must be correctly dimensioned. Set it safely to 1.2GB in a i686 (max value allowed by architecture), 2GB in a x86_64. Unsufficient value causes the mmfsd deamons to stuck on the machine. The same process lock affects later all the remaining machines in the cluster

Some gpfs versions are compliant each other, but using the same overall is recommended. Systems can be upgraded on the fly, one by one, without any consequences If maintenance involves configurator machines, please shift this role on other machines in the cluster Sincronization is mandatory, please use the same ntp servers overall Using jumboframe on Base Level increase dramatically the performances For a nice optimization, avoiding unuseful or redundant traffic, and keeping the system more reliable, the clustering must be applied together with a correct access management in the filesystem access. In picture: the slave cluster UI could import filesystems from master #2, but not from #1, either just few filesystems from #2.

Software Area Software area is intensively used, continous and dense accesses to the disks Small files A huge cache is needed on gpfs servers, responsible for data distribution A gpfs structure is unefficient due to issues related with caching. Better performances achieved through nfs The Clustered NFS provided by gpfs system is dangerous, because a service denial on a pool-machine causes network slowness on the slave cluster which is serving nfs. Reliability is compromised

Gpfs service denial on one of these machines causes the same effect When the service slows down for an intensive use of software area, system suffer for the same effect Distributing over more failure groups doesn't give any more advantages, in case of disaster recovery. Salve clusters slave report almost the same issues. Physical colocation of cnfs disks worsen the situation cnfs must be realized on a separated properly customized cluster. This is for a huge clients amount (>100 units), having dedicated hardware (storage, headnodes, diskspace) Otherwise a better choice is using a classic nfs export. The concept for an integrated system still remains, but saving gpfs benefits.

Data #1 Data #2 Tier3 pool UI Condor pool PBS pool Proof Storm priv. home Gpfs filesystems reado nly Site architecture Gpfs filesystems

Data #1 Data #2 Proof Gpfs filesystems reado nly Proof Gpfs filesystems

Data #1 Data #2 Gpfs filesystems Tier 3 Gpfs filesystems

This is the global site-architecture created for Milan. Clustering has been implemented to take in account the different scopes. Interconnections show the filesystem mounts directed to both the data clusters Proof is a mini-architecture built with few machines, inserted in a balanced pool. Users can log in locally and use Proof nodes as a one cloud. The whole system is attached to gpfs facilities, so that the machines can view the common data filesystems. Tier3 is another mini-arch, implemented through a user-interface and few machines used as worker nodes, executing reserved jobs. The computing element can be installed on the UI as well, or on a further machine configured for this purpose A complete system overview is visible below

Current Overall System Network

GPFS clustering

GPFS rebirth, a use-case Let us suppose to have a cluster created from scratch. Now you probably have: some disk-servers, with their storage systems Some machine elected to be clients, reading and writing to the storages Starting is quite easy. As a system administrator you simply configure the LUNS and their mapping on the disk servers. In any case it's better to map them all on all the servers, for failover purposes. Then, after installing GPFS on all the machines, proceed with the following network settings, as an optimization of the whole system: Create a common ssh key for root and distribute it on all the cluster machines. Modify /etc/ssh/ssh_config, toggling off the mandatory client request Toggle the ssh banner, if present Syncronize ALL the machines using ntp. Setup jumboframes on all the involved nic (and switch ports)

GPFS rebirth, a use-case Configure the network interface with bonding, whereever a huge dataflow is foreseen. Establish a common lan for the cluster. It's not mandatory, but gives good performances. Different clusters can also not have the same lan. Perform this setting on all the machines: cat >> /etc/sysctl.conf <<set # GPFS Settings net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 net.ipv4.tcp_rmem = 4096 262144 8388608 net.ipv4.tcp_wmem = 4096 262144 8388608 net.core.netdev_max_backlog = 2500 set Reboot, to ensure the settings are operative

GPFS rebirth, a use-case Cluster creation: mmcrcluster -n nodelist.txt -R /usr/bin/scp -r /usr/bin/ssh -p ts-b1-1.mi.infn.it -s ts-b1-2.mi.infn.it -A -C clustername.mi.infn.it -U mi.infn.it Where the node list -n has this format: : You have to choose two nodes as cluster configuration managers primary (-p) and secondary (-s). During maintenances on these nodes, their roles must be transferred to other machines: mmchcluster -p (o -s) This special role change always works, even in case of switch off or critical stop of previous configuration manager, cause it's critical for the others to survive. Also the roles for the other nodes can be rotated on the fly: mmchnode -N --quorum –manager mmchnode -N --nonquorum –manager mmchnode -N --quorum –manager mmchnode -N --client Further machines can be added to the cluster later, or even removed: mmaddnode mmdelnode -N Before launching the latter command, you should check if that node has administrative roles and switch them to another node, first.

GPFS rebirth, NSD Network Shared Disk (NSD) creation: mmcrnsd -F nsd-list.txt whereas the text file has following format: : :: : : : sdb:ts-b1-2,ts-b1-1,ts-b1-4,ts-b1-3::dataAndMetadata:1:c2 sdc:ts-b1-3,ts-b1-4,ts-b1-1,ts-b1-2::dataOnly:1:c3 sdd:ts-b1-4,ts-b1-3,ts-b1-2,ts-b1-1::dataOnly:1:c4 sde:ts-b1-1,ts-b1-2,ts-b1-3,ts-b1-4::dataOnly:1:c5 All physical devices must be visible with the same name on all the disk-server nodes exporting them Statistical considerations tell the metadata space to be 1% of total. Hence dedicate, if feasible, a smaller unit just to metadataOnly usage. This push up the performances. Failure groups drive the replication factors of a filesystem. Gpfs can act, if explicitly chosen, as a software raid 1 (mirroring). The fs daemon distributes replicas on different failure groups. If just one is specified instead, gpfs won't replicate at all. Server list helps gpfs to understand where to reach out the disk, with a fail-over policy. If first doesn't work, then it shifts to the second one. In a SAN, with fc switch connections, the user can omit this list. Fs daemon will choose the first disk server of the cluster who can someway connect the disk.

GPFS rebirth, new filesystem mmcrfs /dev/storage_2 -F nsd-list.txt -B 256K -m 1 -M 2 -r 1 -R 2 -Q yes -n 512 -A yes -m -M -r -R options state replication factors. They will work if: * you set them up >1 * more than 1 failure group is present Replica is difficult to cancel. Can be done with restripe only. So plan for it accurately from the very start. Mounting: mmmount storage_2 -a Unmounting: mmumount storage_2 -A option states to have the fs automatically mounted. In case of failures, gpfs nodes will self recover the mounted areas, without interventions. Use these commands directly only if needed. Nsd list has been modified by gpfs during nsd creation with this format: c5:::dataOnly:1:: If a storage pool was specified since nsd declaration, then the filesystem will be created also with this feature. The storage pool is a cluster of disk, the user can select to take in account hardware considerations, disk striping issues, different perfomances and all other business. It's been created inside the filesystem, as it was a physical partition of it

GPFS rebirth, new filesystem “System” storage pool it's the default. Whenever others are present, this always contains metadata informations and file descriptors. To check the filesystem properties issue the command: mmlsfs storage_2 Deleting a disk can be done on-the-fly: mmdeldisk storage_2 “diskname1;diskname2;.....” -N node1,node2,...,nodeN This operation involves automatically a restripe. Use the -N option to select the nodes where to run it It's always possible adding nsd disks on-the-fly to the current configuration: mmadddisk storage_2 -F new-nsd-list.txt This command has other options to additionally make gpfs restripe the filesystem content. This is an intensive operation, please read manual carefully, and use a sensible nodes choice to run it with the -N option Explicit restripe: mmrestripefs storage_4 -b -N ts-a1-1.mi.infn.it,ts-a1-6.mi.infn.it,ts-a1-8.mi.infn.it,ts-a1-10.mi.infn.it This command rebalances the filesystem storage_4, using the listed nodes to accomplish the operation.

GPFS rebirth, fileset A fileset is an attribute of files, identifying a subgroup of them with common properties, or target In grid context, the fileset is used to cope with the atlas token definition Creating: –mmcrfileset /dev/storage_2 atlasdatadisk –mmcrfileset /dev/storage_2 atlasmcdisk –... Linking to a path: –mmlinkfileset /dev/storage_2 atlasdatadisk -J /gpfs/storage_2/atlas/atlasdatadisk This path is created automatically on the filesystem by gpfs, do not mkdir it, it will appear as a normal dir, but it's a sort of link

GPFS rebirth, quota management Switch on the quota system mmquotaon storage_2 Quota can be set on single users, groups, filesets, or be a global default To edit a quota, issue such a command, for fileset based: mmedquota -j storage_5:atlaslocalgroupdisk on a vi-like page, you'll set the proper values for soft and hard-limits Save and the quota will be active Quotas can be adjusted also later To switch a global default quota system, for instance, on users: mmdefquotaon -u storage_3 To set it up: mmdefedquota -u storage_3 To check quota usage on a filesystem: mmrepquota -j storage_2

GPFS rebirth, remote cluster Clustering is the most scalable operation present on a gpfs architecture. Just create the slave cluster issuing the usual mmcrcluster command Create first the security keys on both the clusters mmauth genkey new mmauth update. -l AUTHONLY Copy them respectively on a manager of the partner cluster. File lays in /var/mmfs/ssl/id_rsa.pub On the master issue: mmauth add clusterN -k mmauth grant clusterN -f /dev/storage_2 -a rw (or just r, depending of auth level On the slave cluster: mmremotecluster add clusterMasterName -k -n mmremotefs add /dev/storage_2 -f /dev/storage_2 -C clusterMasterName -A yes Option -A makes the filesystem to be mounted automatically on the slave members, when gpfs are started.

GPFS rebirth, remote cluster mmauth show issued on master (from a whatever node) shows the list of all remote clusters, and the privileges they have. mmremotecluster show issued on slave, shows the list of all masters, it's attached to mmremotefs show issued on slave, shows all the filesystems it has inherited from masters Both the previous commands have the “update” function, enabling the possibility modify an imported filesystem, for authorizations or naming

GPFS rebirth, monitoring Waiter threads are the first indicator of something wrong: mmlsnode -N waiters -L This checks, of all the member nodes, threads eventually present put in wait for events in queue, listing them all with the related sleep time. A value below 1 second is a good performance indicator, even if there's a long queue of waiters. Same output, but on a single machine can be get with: mmfsadm dump waiters Performance monitor: mmpmon -i command-file -p -d 1000 -r 10 | awk -f parsing-file command-file is, as example: once nlist new ts-b1-1 ts-b1-2 ts-b1-3 ts-b1-4 ts-b1-5 ts-b1-6 ts-b1-7 ts-b1-8 fs_io_s a list of monitored members is built, then a filesystem based I/O analysis is run. Parsing-file is whatever awk or bash or perl program can parse the output Without the option -p (programmatic parsing) mmpmon gives:

GPFS rebirth, monitoring By differences between subsequent samples, user can get a complete bandwidth analysis, of the performance on all the filesystems exported. mmpmon node 192.168.127.108 name ts-a1-11 fs_io_s OK cluster: infn-mi-cluster.mi.infn.it filesystem: storage_5 disks: 16 timestamp: 1271693316/868164 bytes read: 1844982491727 bytes written: 0 opens: 22155 closes: 22152 reads: 7050518 writes: 0 readdir: 433 inode updates: 72159

Frontend backend gridftp Users Batch system Local Users Site concept.

Similar presentations

Presentation on theme: "Frontend backend gridftp Users Batch system Local Users Site concept."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frontend backend gridftp Users Batch system Local Users Site concept.

Similar presentations

Presentation on theme: "Frontend backend gridftp Users Batch system Local Users Site concept."— Presentation transcript:

Similar presentations

About project

Feedback