Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Front Weizmann institute May 2007

Similar presentations


Presentation on theme: "David Front Weizmann institute May 2007"— Presentation transcript:

1 David Front Weizmann institute May 2007
Stress testing Athena reconstruction remote COOL reading of up to 100 clients David Front Weizmann institute May 2007

2 What was tested? Following is a summary of stress testing of Athena reconstruction remote COOL reading of up to 100 clients running at the same time. Athena clients run from tier 2 - Weizmann Institute (WI), fetching data from CERN. The intent is to simulate the use case of a tier 2(3) that does not have a local Oracle server, and does run many Athena reconstruction test clients - all of the same run - reading the same data.  3 means of reading COOL data for Athena reconstruction are compared:  1)'WI squid' - squid server running at Weizmann Institute - using frontier server at CERN  2) 'squid' - squid (and frontier) server at CERN. 3) 'oracle' All three means, end up using cooldev Oracle server at CERN

3 Stress testing of Tier 1 replicas with ATHENA?
The tests described at this presentation are not: ‘Tier1 DB stress tests with multiple clients reading concurrently’ planned to be done at Lyon: Since Lyon is a tier 1 (rather than my tier 2 testing), the Oracle server is local, and hence each reconstruction job needs less time to read COOL data (~ 1/10) People involved with testing at Lyon: Richard Hawkings, Stefan Stonjek (MPI) and Ghita Rahal (Lyon) My ‘VerificationClient’ scripts may be used to spawn and monitor multiple clients processes from multiple client hosts

4 Weizmann Institute (WI)
Testing setup Firewall Firewall CERN Weizmann Institute (WI) NAT (350Mbit ) Frontier3d2.cern.ch (lxb5555) Squid server Frontier server Squid server hepsquid1.weizmann.ac.il (100Mbit eth.) 2) ‘squid’ T 1) ‘WI squid’ T Cooldev Oracle server atlascool2.cern.ch Athena client Athena client 3) ‘oracle’ T Client machine (1Gbit eth.) Ping time ~56ms eio49.weizmann.ac.il – ssh tunnel used - ssh tunnel, only for comparison T T

5 Short summary Even though the testing environment has considerable limitations, it seems that: 1) Reading from a squid server at client side is fastest - but consumes the highest amount of client resources. 2) Reading from a squid server at CERN appeared to be too slow, probably because of a limitation of the testing environment (reading via an ssh tunnel to work around a security limitation). 3) Reading from an Oracle server at CERN is considerably slower than 'squid server at client side', ~800 seconds, but not too slow. It scales well (even though the Oracle server is not very strong).  Hence, for this particular use-case, it seems acceptable either to read from far away Oracle server or from squid server at client side.  Having a squid server at client side is 'nice to have', in the sense that it speeds up client reading (and TBD offloads Oracle server), but this tests results do not indicate that it seems to be highly required.

6 Graphs: compare 'WI squid', 'squid' and 'oracle'

7 Explaining the graphs X axis is the target number of clients running at the same time. For each such client process, 5 athena clients are spawned, one after the other (to ensue that the X value is sustained)  Each graph indicates the average value of all clients The graphs - 'elapsed' - time to run athena client (that mainly gets COOL data for reconstruction only), in seconds - 'MBWor' - Mega Bytes each client should read (hard coded according to estimation rather than measured) - 'MBpMinWor' - Mega bytes read per minute - 'load5' - load5 of client machine - 'passed' - number of clients that passed (managed to read all data).     Since each client process should run 5 atherna clients, the target of this number is 5 times the number of client that should run in parallel ('maxClientsRunning') - 'CombinedMbPerSec' The combined Mega Bytes of data read per second (calculated, assuming each client reads 10MB. Network traffic may be different than this measure.) - 'numClientsRunning'. The actual measured number of clients that did run at the same time (measured by each client after fetching its data)

8 HW/SW HW (32bit machines) SW Server machines:
hepsquid1.weizmann.ac.il - 1CPU,2.4GHZ, 0.5GB Memory (CPU Pentium 4) frontier3d2.cern.ch - 2 CPUs 2.8Ghz, 2GB Memory atlascool2.cern.ch - 2 CPUs 1GHz, 1GB Memory All athena clients are running from one strong client machine, dual dual- core Woodcrest, 4 x 2.60 Ghz CPUs and  5.87 GB Memory SW atlascool2 runs cooldev Oracle Db as a non RAC server athena COOL COOL_1_3_3 CORAL CORAL_1_5_3 Athena client: Richard Hawkings client Spawning scripts: VerificationClient 

9 Testing limitations Generally, the following limitations cause the results to look less promising than I expect, if the limitations would not hold. 1) Only one client machine is used because of the following limitations: Weizmann Institute farm is at a NAT (VPN). AFS behind a NAT does not work well for more than one client machine. My stand alone installation of SW for this tests did not work well (with squid CORAL plugin). Hence, The SW is taken from AFS, but only one client machine runs. 2) Tunneling: Is being fixed, thanks to Dirk Duellmann CERN has lately tightened up its security. As a result, I could not directly send squid/frontier requiest from WI to lxb5555, but rather used an ssh tunnel for this. - For squid server at WI, tunneling was used between it and frontier server at CERN, but since data is cached at WI, response time is not harmed by tunneling - Reading from squid server at CERN is slower because of tunneling. - While using Oracle no tunneling was needed. 3) Old version of Coral frontier plugin does a lot of 'select 1 from dual' queries

10 Scaling up to 100 clients with one client machines
- WI squid:  Only up to about 40 clients are handled.  The elased time it low - 30 second per client for one client, 110 seconds for 40 clients   (with multiple clients, load goes up. When load is too high (45), further clients are not spawned until load goes down.    Test to be repeated with more client machines)   - 'squid': 100 clients did run, but elapsed time did grow almost linearly with number of clients up to unacceptable elapsed time of about one hour.  However, this may be due to using ssh tunnel. Test to be repeated without tunnel. This use-case makes less sense than the previous one 'WI squid', in the sense that it does not utilize the advantage of having a squid server near to the client.  - 'Oracle': The tests scaled quite well up to 100 clients (except for some client faliures at 100 clients as seen at the 'passed' graph) Elased time: 750 sec for one client - 830 for 100 clients. this is considerably longer than elapsed of 'WI squid', but is acceptable (according to Richard hawkings).

11 Further testing Work through the limitations, use multiple client hosts, scale testing up and saturate the Oracle server (and squid server) Test with Atlas 13.X rather than , for better performance Following Andrea Valassi’s methodological suggestion: the main parameter to test at future testing (the X axis parameter at graphs) will be Spawn a client every X seconds rather than current parameter: Run Y clients concurrently Advantage: It is easier to compare the resulting performance between different 'reading means' (Oracle/squid .... remote/local) and different metrics Computing X: for a given average elapsed time E: X=E/Y.

12 Links In addition to the above graphs see result details at rows: First graphs: 'athenaReadWi_ _ ‘ Tunnel graphs: 'athenaReadWi_ _ '

13 simulate dependence between network latency and elapsed time
The following results, by Richard Hansen, demonstrates that the time needed to read COOL data (for reconstruction) strongly depends on the network latency: Sleep time - Time to run the job - in seconds            - About 3 and 1/2 hours                                                                                                    75.27 

14 Spare slides

15 Estimating the effect of tunneling
In order to get an estimate of the slow-down caused by tunneling, I have repeated a test with tunneled Oracle. The result, as depicted at the following graphs suggest that Oracle with tunneling performs at the same order of magnitude of elapsed time as (tunneled) squid. One may guess that without tunneling, 'squid' will perform with similar elapsed time to 'oralce' (I did get similar partial results in the past, but this should be fully tested)

16 Comparing ‘oracle’ with ‘squid’ both via ssh tunnels


Download ppt "David Front Weizmann institute May 2007"

Similar presentations


Ads by Google