Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown
The Problem An exhaustive search of proteins against a known database Each string is between 400 and 600 characters long Comparing a 10,000 strings against 1,000,000 random strings would take 44 days with a 1Ghz processor
Why an exhaustive search? Initial intent was to analyze proteins to determine 3-Dimensional structure Exhaustive search is required to ensure that the match that found is the best match
Solution Distribute the search among many PCs to obtain an answer faster. Solution raises more problems, however
How to Distribute Distributing search strings is not enough Also distribute the search space Must find efficient way to distribute 1GB of data without duplication
Program Details Client/Server architecture Uses proprietary protocol over TCP/IP to distribute data Server uses SQL database to store a list of ‘jobs’ and ‘known’ sequences
Program Details Server issues a single ‘job’ to each client upon request Client may also request a batch of data for comparison. Server marks which data has been sent to clients and avoids resending that data to new clients
The Server
The Client
Problems Server is slow at updating its database. This is only seen once for each client however.
Performance Analysis 1 client = 44 days (best algorithm) 2 clients = 26 days 3 clients = 19 days Adding more than 1 client increases time almost linearly, though distribution is expensive
Graphs
Graphs
Graphs
Graphs
Notes about the graphs Graphs do not include initial distribution, since this is only done once per client If search data distribution were to be included, efficiency would start at about 70% and increase to ~90% over time
Verifying Data Accuracy Add an entry into the search table with a known score Ensure that the result returned by the client is the known entry in the database
Lessons Learned Reading and writing single elements to an SQL database can be very expensive. Even the best designs aren’t perfect, especially when the problem is not fully understood.
Notes Biggest problem was distribution of data Distribution was very costly, so we tried to reuse data that we already distributed Program is pluggable, so any comparison algorithm can be used
Question/Comments