Presentation on theme: "Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is Search Engine marketing important? What is a SEO Algorithm?"— Presentation transcript:
Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is Search Engine marketing important? What is a SEO Algorithm? Steps to developing a good SEO strategy Ranking factors Basic tips for optimization
How Do Search Engines Work? Spider “crawls” the web to find new documents (web pages, other documents) typically by following hyperlinks from websites already in their database Search engines indexes the content (text, code) in these documents by adding it to their databases and then periodically updates this content Search engines search their own databases when a user enters in a search to find related documents (not searching web pages in real-time) Search engines rank the resulting documents using an algorithm (mathematical formula) by assigning various weights and ranking factors
Search on the Web Corpus: The publicly accessible Web: static + dynamic Goal: Retrieve high quality results relevant to the user’s need (not docs!) Need Informational – want to learn about something Navigational – want to go to that page Transactional – want to do something (web-mediated) Access a service Downloads Shop Gray areas Find a good hub Exploratory search “see what’s there” Low hemoglobin United Airlines Tampere weather Mars surface images Nikon CoolPix Car rental Finland Abortion morality
Search Engine Wars The battle for domination of the web search space is heating up! The competition is good news for users! Crucial: advertising is combined with search results! What if one of the search engines will manage to dominate the space?
Yahoo! Synonymous with the dot-com boom, probably the best known brand on the web. Started off as a web directory service in 1994, acquired leading search engine technology in 2003. Has very strong advertising and e-commerce partners
Lycos! One of the pioneers of the field Introduced innovations that inspired the creation of Google
Google Verb “google” has become synonymous with searching for information on the web. Has raised the bar on search quality Has been the most popular search engine in the last few years. Had a very successful IPO in August 2004. Is innovative and dynamic.
Live Search ( was: MSN Search) Synonymous with PC software. Remember its victory in the browser wars with Netscape. Developed its own search engine technology only recently, officially launched in Feb. 2005. May link web search into its next version of Windows.
Important? 80% of consumers find your website by first writing a query into a box on a search engine (Google, Yahoo, Bing) 90% choose a site listed on the first page 85% of all traffic on the internet is referred to by search engines The top three organic positions receive 59% percent of user clicks. Cost-effective advertising Clear and measurable ROI Operates under this assumption: More (relevant) traffic + Good Conversions Rate = More Sales/Leads
Experiment with query syntax Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits. “+chess” in a query means the user insists that “chess” be present in all hits. “computer OR chess” means either keywords must be present in all hits. “”computer chess”” means that the phrase “computer chess” must be present in all hits.
The most popular search keywords AltaVista (1998)AlltheWeb (2002)Excite (2001) sexfree appletsex pornodownloadpictures mp3softwarenew chatuknude
Free Keyword Research Tools – https://adwords.google.com/o/Targeting/Explorer?__c=10000000 00&__u=1000000000&__o=te&ideaRequestType=KEYWORD_IDE AS#search.none https://adwords.google.com/o/Targeting/Explorer?__c=10000000 00&__u=1000000000&__o=te&ideaRequestType=KEYWORD_IDE AS#search.none – Keyword Tool and Traffic Estimator to identify competitive phrases and search frequencies – http://www.google.com/insights/search http://www.google.com/insights/search – Compare search patterns across specific regions, categories, time frames and properties
Web search Users Ill-defined queries Short length Imprecise terms Sub-optimal syntax (80% queries without operator) Low effort in defining queries Wide variance in Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only mostly above the fold 78% of queries are not modified 1 query/session Follow links – “the scent of information”...
Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled 26
Crawling picture Web URLs frontier Unseen Web Seed pages URLs crawled and parsed Sec. 20.2 27
Motivation for crawlers Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing… … Can you think of some others?… 28
A crawler within a search engine 29 Web Text indexPageRank Page repository googlebot Text & link analysis Query hits Ranker
One taxonomy of crawlers Many other criteria could be used: Incremental, Interactive, Concurrent, Etc. 30
Basic crawlers This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything
Graph traversal (BFS or DFS?) Breadth First Search Implemented with QUEUE (FIFO) Finds pages along shortest paths If we start with “good” pages, this keeps us close; maybe other good stuff… Depth First Search Implemented with STACK (LIFO) Wander away (“lost in cyberspace”) 32
Universal crawlers Support universal search engines Large-scale Huge cost (network bandwidth) of crawl is amortized over many queries from users Incremental updates to existing index and other data repositories 33
Large-scale universal crawlers Two major issues: 1. Performance Need to scale up to billions of pages 2. Policy Need to trade-off coverage, freshness, and bias (e.g. toward “important” pages) 34
Large-scale crawlers: scalability Need to minimize overhead of DNS lookups Need to optimize utilization of network bandwidth and disk throughput (I/O is bottleneck) Use asynchronous sockets Multi-processing or multi-threading do not scale up to billions of pages Non-blocking: hundreds of network connections open simultaneously Polling socket to monitor completion of network transfers 35
Universal crawlers: Policy Coverage New pages get added all the time Can the crawler find every page? Freshness Pages change over time, get removed, etc. How frequently can a crawler revisit ? Trade-off! Focus on most “important” pages (crawler bias)? “Importance” is subjective 36
Web coverage by search engine crawlers This assumes we know the size of the entire the Web. Do we? Can you define “the size of the Web”?
Maintaining a “fresh” collection Universal crawlers are never “done” High variance in rate and amount of page changes HTTP headers are notoriously unreliable Last-modified Expires Solution Estimate the probability that a previously visited page has changed in the meanwhile Prioritize by this probability estimate 38
Do we need to crawl the entire Web? If we cover too much, it will get stale There is an abundance of pages in the Web For PageRank, pages with very low prestige are largely useless What is the goal? General search engines: pages with high prestige News portals: pages that change often Vertical portals: pages on some topic What are appropriate priority measures in these cases? Approximations? 39
Complications Web crawling isn’t feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps – incl dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Politeness – don’t hit a server too often Sec. 20.1.1 40
What is robots.txt? It’s a file in the root of your website that can either allow or restrict search engine robots from crawling pages on your website.
How does it work? Before a search engine robot crawls your website, it will first look for your robots.txt file to find out where you want them to go. There are 3 things you should keep in mind: Robots can ignore your robots.txt. Malware robots scanning the web for security vulnerabilities, or email address harvesters used by spammers, will not care about your instructions. The robots.txt file is public. Anyone can see what areas of your website you don’t want robots to see. Search engines can still index (but not crawl) a page you’ve disallowed, if it’s linked to from another website. In the search results it’ll then only show the url, but usually no title or information snippet. Instead, make use of the robots meta tag for that page.
What to put in your robots.txt file User-agent: This is the line where you define which robot you’re talking to. It’s like saying hello to the robot: User-agent: * (Googlebot - Google, Slurp – Yahoo) Disallow: This tells the robots what you don’t want them to crawl on your site: Disallow: / (do not crawl anything on my site) /images/ Allow This tells the robots what you want them to crawl on your site. Allow: /
What to put in your robots.txt file (Asterisk / wildcard *) With the * symbol, you tell the robots to match any number of any characters. Very useful for example when you don’t want your internal search result pages to be indexed. Disallow: *contact* (do not crawl any urls containing the word contact) $ (Dollar sign / ends with) The dollar sign tells the robots that it is the end of the url. Disallow: *.pdf$ # (Hash / comme You can add comments after the “#” symbol, either at the start of a line or after a directive.
What to put in your robots.txt file Crawl-Delay This directive asks the robot to wait a certain amount of seconds after each time it’s crawled a page on your website.. Crawl-delay: 5 Request-rate: Here you tell the robot how many pages you want it to crawl within a certain amount of seconds. The first number is pages, and the second number is seconds. Request-rate: 1/5 # load 1 page per 5 seconds Visit-time: It’s like opening hours, i.e. when you want the robots to visit your website. This can be useful if you don’t want the robots to visit your website during busy hours (when you have lots of human visitors). Visit-time: 2100-0500 # only visit between 21:00 (9PM) and 05:00 (5AM) UTC (GMT)
Test your page https://www.google.com/webmasters/
What is SEO? SEO = Search Engine Optimization Refers to the process of “optimizing” both the on- page and off-page ranking factors in order to achieve high search engine rankings for targeted search terms. Refers to the “industry” that has been created regarding using keyword searching a a means of increasing relevant traffic to a website
What is a SEO Algorithm? Top Secret! Only select employees of a search engines company know for certain Reverse engineering, research and experiments gives SEOs (search engine optimization professionals) a “pretty good” idea of the major factors and approximate weight assignments The SEO algorithm is constantly changed, tweaked & updated Websites and documents being searched are also constantly changing Varies by Search Engine – some give more weight to on-page factors, some to link popularity
A good SEO strategy: Research desirable keywords and search phrases (WordTracker, Overture, Google AdWords)WordTrackerOvertureGoogle AdWords Identify search phrases to target (should be relevant to business/market, obtainable and profitable) “Clean” and optimize a website’s HTML code for appropriate keyword density, title tag optimization, internal linking structure, headings and subheadings, etc. Help in writing copy to appeal to both search engines and actual website visitors Study competitors (competing websites) and search engines Implement a quality link building campaign Add Quality content Constant monitoring of rankings for targeted search terms
Ranking factors On-Page Factors (Code & Content) #3 - Title tags #5 - Header tags #4 - ALT image tags #1 - Content, Content, Content (Body text) #6 - Hyperlink text #2 - Keyword frequency & density Off-Page Factors #1 Anchor text #2 - Link Popularity (“votes” for your site) – adds credibility
What a Search Engine Sees View > Source (HTML code)
Pay Per Click PPC ads appear as “sponsored listings” Companies bid on price they are willing to pay “per click” Typically have very good tracking tools and statistics Ability to control ad text Can set budgets and spending limits Google AdWords and Overture are the two leaders Google AdWordsOverture
PPC vs. “Organic” SEO Pay-Per-Click“Organic” SEO results in 1-2 days easier for a novice or one little knowledge of SEO ability to turn on and off at any moment generally more costly per visitor and per conversion fewer impressions and exposure easier to compete in highly competitive market space (but it will cost you) Ability to generate exposure on related sites (AdSense) ability to target “local” markets better for short-term and high-margin campaigns results take 2 weeks to 4 months requires ongoing learning and experience to achieve results very difficult to control flow of traffic generally more cost-effective, does not penalize for more traffic SERPs are more popular than sponsored ads very difficult to compete in highly competitive market space ability to generate exposure on related websites and directories more difficult to target local markets better for long-term and lower margin campaigns
Keys to Successful SEO Strategy 1. Do not underestimate the importance of keyword research 2. Be sure to include the proper tags in your page coding 3. You must have optimized content! (3-5 uses of keyword per 250 words) 4. Use content marketing
Keyword Selection Marketing/Brand Relevance Search Frequency Competition Optimization Opportunity How closely does the keyword match your product/service offering, messaging, goals and objectives? How much competition (large, authority sites) is there for the particular keyword? Is there already a logical place on the site to optimize for the particular keyword? How many people are searching on the particular keyword?