Presentation on theme: "Intelligent Bayesian Network-Based Approaches for Web Proxy Caching Prepared By : Waleed Ali Ahmed & Siti Mariyam Shamsuddin Soft Computing Research Group,"— Presentation transcript:
Intelligent Bayesian Network-Based Approaches for Web Proxy Caching Prepared By : Waleed Ali Ahmed & Siti Mariyam Shamsuddin Soft Computing Research Group, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Johor, Malaysia firstname.lastname@example.org, email@example.com
Introduction Related Works The Proposed Intelligent Web Proxy Caching Approaches Implementation and Performance Evaluation Conclusion and Future works Outline
Background Web caching is one of the most successful solutions for improving the performance of Web-based systems. Web caching is a well-known strategy for improving the performance of Web-based system by keeping Web objects that are likely to be used in the near future in location closer to user. Why? To decrease latencies To reduce web server loads To reduce bandwidth usage
In Web proxy caching, the popular web objects that are likely to be revisited in the near future are stored on the proxy server which plays the key roles between users and web sites in reducing the response time of user requests and saving the network bandwidth. Web proxy caching
Proxy servers play the key roles between users and web sites, which could reduce the response time and save network bandwidth. The most common caching strategy. The proxy caching is widely utilized by computer network administrators, technology providers, and businesses to reduce user delays and to alleviate Internet congestion (Kaya et al., 2009; Kumar, 2009, Kumar et al., 2008) Why Web proxy caching?
Since the apportioned space to the cache is limited, the space must be utilized judiciously (Romano and ElAarag, 2011). The most common Web caching methods are not enough efficient and may suffer from cache pollution problem (Cobb and ElAarag, 2008 ; Koskela et al., 2003). Reduction of the effective cache size Low hit Wasting bandwidth. Overload on the original server So far, the difficulty in determining which ideal web objects will be re-visited is still a major challenge Problem Statement
Motivations for using machine learning In Web caching Availability of web access logs and trace files or history of accesses that considered complete and prior knowledge of future accesses. The need to efficient and adaptive scheme since Web environment changes and updates rapidly and continuously. Recent studies have proposed utilized ANN in web proxy caching although ANN training may consume long time and require extra computational overhead. More significantly, integration of intelligent technique in web cache replacement is still under research. Intelligent Web Proxy Caching
The suggested solutions We present new intelligent approaches that depend on the capability of Bayesian Network (BN) to learn from Web proxy logs files and predict the classes of objects to be re-visited or not. More significantly, the trained BN classifier is incorporated effectively with traditional Web proxy caching algorithm to present novel intelligent web proxy caching approaches
Bayesian Network (BN) A Bayesian network is one of the most popular machine learning models that depends on probability estimations to find a class of an observed pattern. Rationale: – The Bayesian network (BN) is defined as a directed acyclic graph over which is defined a probability distribution. Each node in the graph represents a random variable or event, while the arcs or edges between the nodes represent association or causal relationship
Bayesian Network (BN) The probabilistic dependency is maintained by the conditional probability table(CPT), which is attached to the corresponding event. In classification tasks : – the classification decision is calculated simply using formula. probability of finding the pattern x in class c, probability of class c
Why BN in Web Caching ? Bayesian networks are popular supervised learning algorithms that have great popularity in medical filed and other applications such as military applications, forecasting, control, modelling for human understanding, cognitive science, statistics, and philosophy. Hence, Bayesian networks can be utilized to produce promising solutions for Web proxy caching.
Intelligent Web Caching? The conventional Web caching methods are not enough efficient (Cobb and ElAarag, 2008 ; Koskela et al., 2003) Therefore, several researchers have proposed incorporating intelligent solutions to cope with Web caching problem. According to Chen (2008), the intelligent approaches are more efficient and more adaptive to Web caching environment compared to others approaches
Summary of intelligent web caching From the previous studies, we can observe two approaches in intelligent web caching. An intelligent technique is employed in web caching individually. An intelligent technique is employed with LRU Algorithm. Both approaches may predict Web objects that can be re- accessed; However, They did not take into account the cost and size of the predicted objects in the cache replacement decision. Some important features are ignored. The training process requires long time and extra computational overhead.
Proposed approachExisting approaches takes in consideration the most effective factors in cache replacement decision One factor or more ignored in cache replacement decision depends on BN that can achieve much better accuracy and faster than BPNN and ANFIS. depend on ANN or ANFIS that their training may consume long time and require extra computational over head. Integrates BN classifier into GDS algorithm that takes the cost and size of cached objects in consideration --- BN is effectively integrated with LRUIntelligent technique is employed individually or integrated with LRU Proposed Approach VS Existing Approaches
The Proposed Intelligent Web Proxy Caching Approaches
The operational framework for the proposed approach
The framework consists of two functional components: Offline component: It works only while the proxy server in leisure periods. It is responsible for training BN classifier. Online component: The intelligent caching strategies are executed in this part. A Framework for the proposed approach
In the online component, the intelligent caching strategies are achieved for managing proxy cache content. We propose intelligent web proxy caching approaches depends on integrating BN with traditional Web caching to provide more effective caching policies Bayesian Network-Greedy-Dual-Size Approach (BN-GDS): BN classifier is integrated with GDS for improving the performance in terms of the byte hit ratio of GDS. Bayesian Network-Least-Recently-Used Approach (BN-LRU) : BN classifier is combined with LRU to form a new algorithm called BN-LRU. Bayesian Network-Dynamic Aging Approach(BN-DA): BN classifier is combined with dynamic aging (DA) to form a new algorithm called BN- DA. Online Component
The Greedy-Dual- Size (GDS) caching algorithm was proposed by Cao and Irani (1997). The algorithm assigns a key value K(p) to each object p in the cache, so that the object with the lowest key value is replaced : where C(p) is the cost to bring object p into the cache; S(p) is the object size; L is an inflation factor that starts at 0 and is updated to the key value of the last replaced object. If an object is accessed again, its key value is updated using the new L value. 1- The intelligent BN-GDS approach
Cherkasova(1998) enhanced GDS algorithm by incorporating a frequency count, so the algorithm is called Greedy- Dual-Size-Frequency (GDSF) algorithm. where F(p) is the access count of object p. One advantage of GDSF policy is GDSF performs well in terms of the hit ratio. However, the byte hit ratio of GDSF policy is too low. Therefore, BN classifier is integrated with GDS for improving the performance in terms of the byte hit ratio, called BN-GDS. 1- The intelligent BN-GDS approach
In the proposed BN-GDS, GDS is enhanced by incorporating the accumulative scores or probabilities of revisiting object g depending on BN classifier as in Eq. This means that the key value of object g is determined not just by its past occurrence frequency, but also by the class predicted depending on the six factors. The rationale behind the proposed BN-GDS approach is that we can enhance the priority of those cached objects that may be revisited in the near future according to the BN classifier, even if they are not accessed frequently enough 1- The intelligent BN-GDS approach
LRU policy is the most common proxy caching policy; However, LRU policy suffers from cold cache pollution. In other words, in LRU, a new object is inserted at the top of the cache stack. If the object is not requested again, it will take some times to be moved down to the bottom of the stack before removing it. For reducing cache pollution in LRU, BN classifier is combined with LRU to form a new algorithm called BN- LRU. 2- The intelligent BN-LRU approach
The proposed SVM-LRU is worked as follows: When the web object g is requested by user, BN classifier predicts the class of that object either will be revisited again or not. If the object g is classified by BN as object will be re- visited again, the object g will be placed on the top of the cache stack. Otherwise, the object g will be placed in the middle of the cache stack. Hence, BN-LRU can efficiently remove the unwanted objects early to make space for the new Web objects. 2- The intelligent BN-LRU approach
In addition to frequency, several factors can contribute in predicting the revisiting of the object in the future. The proposed BN-DA approach combines the most significant factors depending on Bayesian network (BN)classifier for predicting probability that Web objects can be re-visited later. In the proposed BN-DA approach, when user visits Web object g, the trained BN classifier can predict the probability of belonging g to the class with objects may be revisited. Then, the probabilities of g are accumulated as scores used in cache replacement decision 3- The intelligent BN-DA approach
1-Data collection We have obtained data of the proxy logs files of web objects requested in several proxy servers located around the United States of the IRCache network for fifteen days (NLANR, 2010). In this study, the proxy log files of 21st August, 2010 were used in the training phase, while the proxy log files of the following days were used in simulation and implementation phase Proxy datasetProxy server nameLocation Duration of collection BO2bo.us.ircache.netBoulder, Colorado21/8 – 4/9/2010 SVsv.us.ircache.net Silicon Valley, California (FIX-West) 21/8 – 4/9/2010 SDsd.us.ircache.netSan Diego, California21/8 – 28/8/2010 NYny.us.ircache.netNew York, NY21/8 – 4/9/2010
2-Data Pre-processing The data preprocessing involves removing the irrelevant requests from the log files since some the log entries are not valid or irrelevant entries. The trace preparation is carried out as follows Parsing: identifying the boundaries between successive fields and records in logs file Filtering: This includes elimination of irrelevant entries such as The uncacheable requests and Entries with unsuccessful HTTP status codes. Finalizing: This involves removing unnecessary fields. Moreover, each unique URL is converted to a unique integer identifier for reducing time of simulation.
The final format of our data consists of URL ID, timestamp, elapsed time, size and type of web object URL_IDTimestamp Elapsed Time (milliseconds) Size(bytes)Type 11282348905.733333070application/octet-stream 21282348907.4170314179image/jpeg 31282348908.472841276image/jpeg 41282349578.7515424612text/html 11282349661.613133070application/octet-stream 51282349675.352035592text/html 61282349688.9023134796text/html 41282349753.7237524612text/html 41282350464.0113324612text/html 11282351887.7613533070application/octet-stream 41282352609.095524612text/html 11282352861.5611133070application/octet-stream 2-Data Pre-processing
3-Training Phase The training pattern takes the format:
Inputs Target Recency Frequenc y SWL Frequenc y Retriev al Time SizeType 180011333307051 1800117031417920 180011284127620 1800111542461211 180022313307050 180011203559210 1800112313479610 1800223752461211 1800331332461210 2226.15311353307051 2145.0841552461210 1800421113307050 Preparation of Dataset for web objects classification 3-Training Phase
Each proxy dataset is then divided randomly into training data (70%) and testing data (30%). Subsequently, the dataset is discretized accordingly using MDL method suggested by Fayyad & Irani (1993) with default setup in WEKA. Finally, the Bayesian network (BN) is trained using WEKA as well. In WEKA, BN algorithm is available in the Java class weka.classifiers.bayes.BayesNet. The default values of parameters and settings predefined in WEKA are used in BN training.
4-Performance Evaluation We have modified the WebTraff simulator (Markatchev and Williamson,2002) to meet our proposed proxy caching approaches. The trained classifiers are integrated with WebTraff simulator to simulate the proposed intelligent web proxy caching approaches.
There are common measures to analyze the efficiency Hit Ratio (HR) Byte Hit Ratio (BHR) 4-Performance Evaluation
Analysis of IRcache traces 4-Performance Evaluation BO2NYUCSV SD #Total requests 121069332484528891764249600129871204 #Cacheable requests 5949891518232282790411940986059349 #Cacheable bytes 232049303416840203631946936258408348043794224230326816876 #Unique requests 5301921144885240240610123555284441 Total size of unique requests ( bytes) 186900934505614790376115653817175238364029432190539902251 #Hits 64797373347425498181743774908 #Byte Hits 451483689112254132558312824412331967976479239786914625 Max HR(%) 10.8924.5915.0515.2212.79 Max BHR(%) 19.4617.9166.6520.1517.27
4-Performance Evaluation Impact of cache size on HR for different proxy datasets (a) BO2 HR (b) NY HR
BN-GDS achieves the best HR among all algorithms, while LRU achieves the worst HR among all algorithms. BN-GDS and BN-LRU improve the performance in terms of HR for GDS and LRU respectively Although HR of BN-DA is worse than HR of GDS and GDSF, HR of BN-DA is better than HR of NNPCR-2, BN-LRU and LRU. 4-Performance Evaluation In terms of Hit Ratio(HR)
Impact of cache size on BHR for different proxy datasets 4-Performance Evaluation (a) BO2 HR (b) NY HR
BN-LRU and BN-DA achieve the best BHR among all algorithms, while GDS and GDSF attain the worst BHR. BHR of LRU is better than BHR of BN-GDS, GDS and GDSF. BN-GDS improve significantly BHR of GDS and GDSF BN-LRU and BN-DA have better BHR compared with BHR of LRU and NNPCR-2. 4-Performance Evaluation In terms of Byte Hit Ratio(BHR)
Conclusion This study has proposed three Intelligent Web proxy caching approaches called BN-GDS, BN-LRU and BN-DA for improving performance of the conventional Web proxy caching algorithms. BN classifier learns from Web proxy logs file to predict the classes of objects to be re-visited or not. The trained classifier is integrated effectively with conventional web proxy caching to provide more effective proxy caching policies. The simulation results have revealed that BN-GDS achieved the best HR, better BHR compared to GDS and GDSF, and acceptable BHR compared to BN-LRU and BN-DA that achieved the best BHR. That means BN-GDS was able to make better balance between HR and BHR than other algorithms. On the other hand, BN-LRU and BN-DA achieved the best BHR among all algorithms, and better HR compared LRU and NNPCR-2.
Future works In the future: Other intelligent classifiers can be utilized to improve the performance of traditional web caching policies. Clustering algorithms can be used for enhancing performance of web caching policies.
References Kaya, C.C., Zhang, G., Tan, Y., & Mookerjee, V.S. 2009. An admission-control technique for delay reduction in proxy caching. Decision Support Systems, 46, 594-603. Kumar, C. 2009. Performance evaluation for implementations of a network of proxy caches. Decision Support Systems, 46, 492-500. Kumar, C., & Norris, J.B. 2008. A new approach for a proxy-level web caching mechanism. Decision Support Systems, 46, 52-60. Romano, S., & ElAarag, H. 2011. A neural network proxy cache replacement strategy and its implementation in the Squid proxy server. Neural Computing & Applications, 20, 59-78. Cobb, J., & ElAarag, H. 2008. Web proxy cache replacement scheme based on back- propagation neural network. Journal of Systems and Software, 81, 1539-1558. Koskela, T., Heikkonen, J., & Kaski, K. 2003. Web cache optimization with nonlinear model using object features. Computer Networks, 43, 805-817. Chen, H.T. 2008. Pre-fetching and Re-fetching in Web caching systems: Algorithms and Simulation. TRENT UNIVESITY,Peterborough, Ontario, Canada, Peterborough, Ontario, Canada. Cao, P., & Irani, S. 1997. Cost-Aware WWW Proxy Caching Algorithms. IN PROCEEDINGS OF THE 1997 USENIX SYMPOSIUM ON INTERNET TECHNOLOGY AND SYSTEMS. Publishing, Monterey, CA. Cherkasova, L. 1998. Improving WWW Proxies Performance with Greedy-Dual-Size-Frequency Caching Policy. In HP Technical Report, Palo Alto.
References NLANR. 2010. National Lab of Applied Network Research(NLANR). Sanitized access logs: Available at http://www.ircache.net/.http://www.ircache.net/ Fayyad, U.M., & Irani, K.B. 1993. Multi-interval discretization of continuous-valued attributes for classification learning, 13th International Joint Conference on Artificial Intelligence (IJCAI- 93). Publishing, pp. 1022-1027. Markatchev, N., & Williamson, C., 2002. WebTraff: A GUI for Web Proxy Cache Workload Modeling and Analysis. Proceedings of the 10th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems. Publishing, p. 356.