Sizing your Alfresco platform Luis Cabaceira Technical Consultant.

Sizing your Alfresco platform Luis Cabaceira Technical Consultant

Agenda Who am I Presentation Objectives Key Sizing questions 2 Common use cases Architecture and strategies Sizing the different application layers

Who am I Luis Cabaceira, Technical Consultant at Alfresco 13 years of field experience Helped clients in various fields on architecture and sizing decisions Government & Intelligence Banking & Insurance Manufacturing Media & Publishing Government & Intelligence Banking & Insurance ManufacturingMedia & Publishing High Tech

Objectives of the presentation Explain sizing key questions that will drive sizing decisions Architecture best practices Explain the impact of system usage on the different application layers

Considerations The information provided on this presentation is based on Alfresco internal benchmarks as well as our field experience in various customer implementations. Sizing predictions should be validated by executing dedicated benchmarks. Monitoring your solution will also help on validating your sizing predictions.

Monitor your system

Sizing Key Information The best way to perform a sizing exercise is to ask the right questions and gather relevant information that can support the sizing decisions. Gathering this information should be the first step in order to do an Alfresco-sizing proposal.

Use case Gather all information related with how the solution is intended to be used by final end users. Identify the systems that the solution will be integrated with.

Concurrent Users Probably the most important factor while sizing and alfresco system. Number of Concurrent Users (including system users) Consider detail around average think- time for the number of concurrent users defined Details on average and peak values.

Documents Tipology What types of documents will be used ? What are the average document sizes ? Distribution ratio on the various types. Estimated Repository size: number of documents (nodes) Injection Rate: the repository growth or document upload rate

Architecture Architectural requirements and Alfresco components role in overall architecture. Is high availability needed? Failover ? Will the project be based on virtualization ? Stack components being used.

Authority structure Number of Users Number of user groups Group tree/structure depth Average number of users Per Group Average number of Groups per user

Operations Percentage of read and write operations. Split ratio between browse/download/search/Write Which types of search operations will be executed (global, full text, custom search). Time window frames for the operations execution. Averages and peaks

Components Protocols and Apis Which components of Alfresco will be used Which APIs will be used. Which protocols will be used ?

Batch processes Aproximate numbers and and types of batch jobs workflows scheduled processes transformations and renditions

Customizations and Integrations Details around planned customizations Customizations impact analisys Details on external system integrations. If there’s already developed code evaluation on the code performance and possible optimization of the same code.

Response Times The detail around response time requirements. May include requirements over the average operations response time, but also limits for each single operation and per operation type.

Sizing questions Considerations The most important topics are Concurrent Users Use Case Repository documents In many cases organizations are not able to provide the answers to the other questions.

The 2 most common use cases I can enumerate some generic common use cases but the details on which each real implementation differs may be very important from an architecture and sizing perspective. Generally Alfresco solutions can be classified on one of the 2 following cases: Collaboration Backend Repository

Collaboration x Backend Collaboration Scenario Search is usually just a small portion of the operations percentage (around 5%) User authority structure will be complex Most uploads are manually driven Backend Scenario in most of the cases especially for very large repositories there wont be full text indexing/search Authority structures will be in general fairly simple. Dedicated nodes for ingestion are normally used.

Collaboration x Backend Collaboration Scenario Repository Sizes are usually of small or intermediate size. Customization in most cases will concentrate at the front end (Share). Architecture options are in general the standard ones provided by Alfresco (cluster, dedicated index/transformation layers, etc). Backend Scenario Repository sizes are usually quite big. Custom solution code may live external to Alfresco by using CMIS, public APIs, etc. Architecture options are normally more complex using proxies, cluster and unclustered layers, sharding of Alfresco repository, etc

Collaboration x Backend Collaboration Scenario Concurrent users are normally quite high Share interface is used, very common usage of SPP, CIFS, IMAP, WebDAV and other public interfaces (CMIS) for other interfaces (mobile). Backend Scenario Concurrent users are in general fairly small but think times are much smaller than for collaboration. Most of the load should concentrate around public API (CMIS) and custom developed REST API (Webscripts).

Collaboration x Backend Collaboration Scenario Batch operations should mostly be around human interaction workflows and the standard Alfresco jobs. Backend Scenario Batch operations will usually have a considerable importance, including content injection processes (bulk or not), custom workflows and scheduled jobs.

There is no perfect formula There is no perfect formula that can be applied to determine the appropriate sizing for every scenario. Sizing and capacity planning depend of several important factors that change from project to project such as: Number and type of customizations External systems integrations User Behavior Network topology

Reference Architecture Study

7 Alfresco Layers For our sizing exercise we need to consider the important layers that compose our Alfresco architecture. Database Layer Repository Layer (User facing nodes) Repository Layer (Dedicated tracking nodes) Repository Layer (Bulk Ingestion Node) Search Layer (Solr Node(s)) Client Interface Layer (Alfresco Share) Content Store

Database Sizing All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case.

Database Sizing The following factors are relevant to calculate the approximate size for an Alfresco database. Number of meta data fields Permissions Number of folders Number of documents Number of versions Number of users/Groups

Database Size calculation To calculate de approximate size of the Alfresco database you can use the following formula. D = Number of Documents DA = Average number of Metadata fields per document F = Number of Folders FA = Average number of Metadata fields per Folder Dbsize = ((D * DA) + (F*FA)) * 5.5k

Database Tuning is Vital In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into: I/O bottlenecks in the RDBMS throughput Excessive queue for transactions due to overload of connections On active-active cluster configurations, excessive latency

Database Thread Pool A default Alfresco instance is configured to use up to a maximum of forty (40) database connections. Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols. I recommended to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75.

Database Scaling Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines.

User facing nodes sizing The repository user-facing cluster has the nodes dedicated on serving the usual user requests (reads, manual writes and updates). The most important factor that stresses these servers is the number of concurrent users hitting the cluster. Note that these nodes also do transformations to swf when a user accesses the document preview. Uploads will also create thumbnail images.

User facing nodes sizing Operations: Percentage of browse, download and write operations Components, Protocols and API: Which components, protocols and APIs of the product will be used. Response Times: The detail around response time requirements.

User facing nodes Golden Rules Browsing the repository using a client (share or explorer) performs Acl checks on each item present on the folder being browsed. Acl checks go against the database and occupy a thread in tomcat, causing cpu consumption. Consider having a max of 1000 documents on each folder to reduce this overhead. Add a transformations timeout and configure the transformation limits to prevent threads spending to much time doing transformations.

User facing nodes Golden Rules User operations take application server threads. The more concurrency the more threads will be occupied and the more cpu will be used. Size the maximum number of application server threads (and the number of cluster nodes) according to your expected concurrency. Include the Alfresco scheduled tasks on your cpu calculations, those also require server threads and consume cpu.

User facing nodes Golden Rules Customizations on the repository should be carefully analyzed for memory and cpu consumption. Check for unclosed resultsets, they are a common memory leak. The local repository caches (L2 caches) occupy server memory, size them wisely. Check my blog post on the alfresco repository caches.http://blogs.alfresco.com/wp/lcabaceira/2014/ 05/29/repository-caches/http://blogs.alfresco.com/wp/lcabaceira/2014/ 05/29/repository-caches/

User facing nodes sizing Rec. By default Alfresco has only 1 thread configured for libreoffice, but the number of threads can be increased using the port numbers on jodConverter. If have lots of uploads and you need faster (concurrent) transformations, consider raising the number of libreoffice threads. Have this in consideration on your memory calculations for these servers, it takes typically 1GB for each Libreoffice thread.

User facing nodes Sizing Rec. Considering our field experience and our internal benchmark values we recommend adding an extra alfresco node for each 250 concurrent users. Number of Nodes = Number of Concurrent Users / 250. Number of Cpus Per Node = 3 quad-core or 2 hexacore (12 cores) Calculations are based on a (60/40) read/write ratio User think time considered (30s)

Dedicated tracking nodes sizing The dedicated tracking nodes exist on each Solr server and they are the only reference to Alfresco that Solr knows about. The dedicated tracking Alfresco instances perform. Text extraction of the document’s content to send to Solr. (Alfresco api call) Allow for local indexing (document’s do not have to travel over the wire during tracking)

Dedicated tracking nodes sizing Important facts Dedicated alfresco tracking nodes serve as a database and solr proxy, they are mostly used for FTS transformations. Memory is impacted by the number of documents and transactions on the documents.

Dedicated tracking nodes sizing Sizing Factors Operations: Percentage of write operations. Document Details: types, sizes and distribution ratios Ingestion Rate : (repository growth rate).

Dedicated tracking nodes sizing Normally these nodes do not have high requirements on memory and CPU but sizing will depend on the number of expecting uploads and transactions on the system. Indexing (Text extraction) impacts CPU and it’s API call that occur on these nodes.

Bulk ingestion nodes sizing Sizing the nodes dedicated to content ingestion is very similar to sizing the tracking nodes, these nodes do not take any user requests but they do generate document thumbnails. They are mainly database and solr proxys. They read xml files and there is few memory consumption. Ingestion of content is CPU intensive, The load on this server will mostly impact Solr indexing and the dedicated tracking instances (text extraction).

Bulk ingestion nodes tuning Disable unneeded services, many standard services are not required when bulk loading content Verify the existence of rules on target folder that can delay the speed of the ingestion Minimize network and file I/O operations Get source content as close to server storage as possible Tune JVM, Network, Threads, DB Connections.

Solr nodes sizing The solr nodes are the one’s that need more memory. Indexing impacts CPU usage, but the memory is the most important factor on these servers. Solr performance depends on the cache, bigger repositories require more memory (bigger caches) There is a public formula published by Alfresco to calculate the memory and CPU for your Solr nodes.

Solr nodes sizing factors Authority Structure Recent benchmarks comparisons that authority structure has a direct and important impact on performance of SOLR. When sizing Solr we should analyze the types of searches that will be executed and the authority structure of the corresponding use case

Solr nodes sizing factors Repository Size Important to notice that repository sizes should mostly only affect average response times for search operations, and more importantly for certain kinds of global searches very sensitive to the overall size of the repository. This is the layer most affected by the increase on repository size.

About Solr nodes memory To minimize memory requirements on solr : Reduce the cache sizes and check the cache hit rate. Disable ACL checks. Disable archive indexing, if you are not using it. Since everything scales to the number of documents in the index, add the Index control aspect to the documents you do not want in the index

Solr nodes tuning There are a lot of available tuning options for Solr. Those can have a huge impact on your performance. Check my blog post on Solr tuning tips. http://blogs.alfresco.com/wp/lcabaceira/2014/06/20/ solr-tuning-maximizing-your-solr-performance/

Content Store Sizing The performance of the storage where the content store resides is vital for the Alfresco overall performance. Number of documents (ND) Number of Versions (NV) Average document size (DS) Annual Growth rate Content Store Size = (ND * NV * DS)

Additional Info Alfresco Blog http://blogs.alfresco.com/wp/lcabaceira/ Github page https://github.com/lcabaceira

Thank you

Sizing your Alfresco platform Luis Cabaceira Technical Consultant.

Similar presentations

Presentation on theme: "Sizing your Alfresco platform Luis Cabaceira Technical Consultant."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sizing your Alfresco platform Luis Cabaceira Technical Consultant.

Similar presentations

Presentation on theme: "Sizing your Alfresco platform Luis Cabaceira Technical Consultant."— Presentation transcript:

Similar presentations

About project

Feedback