Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concordancing the Web with KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American.

Similar presentations

Presentation on theme: "Concordancing the Web with KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American."— Presentation transcript:

1 Concordancing the Web with KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001

2 How Big is the Web? Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.) “Invisible web” / restricted sites several times larger Estimated 80%-95% content in English, but… Since mid 2000, non-Anglophones outnumber English speakers online Anglophones < 30% of 850 million users in 2005 Percentage of new users fluent in English decreasing For many regions / languages, still no data available

3 Search Purposes General users typically seek… a specific site any well-stocked site meeting their needs Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resources Educators want to foster similar online research behavior in their students

4 Typical Search Behaviors Marked preference for directories with pre-selected links organized by topic over full-text search engines Simple queries – single word or phrase – predominate (80%-90%) 10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formed Users tend to work in a single window, calling up one document at a time, then returning to search engine for another link

5 Typical Search Outcomes Users follow up only first few links, then settle on a page after browsing from these Usual outcome is a match, not best match

6 Ways to Use the Web for Instruction and Research Micro level Discover eloquent examples Verify current / possible usage, with rough indication of prevalence Acquire vocabulary not (yet) in dictionaries Timeliness is essential -- “off-the-shelf corpora” often cannot help here! Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)

7 Ways to Use the Web for Instruction and Research (2) Macro level Find authentic texts accessible to students Locate relevant online resources for research projects Student reports Scholarly research

8 Impediments to Finding Relevant Resources Online Reliance on commercial search engines (SEs) essential due to Web’s size SEs’ priorities match ours only by coincidence Link rot Pages move or disappear Page content changes

9 Challenges to Responsible Research Online there is too much ephemeral content of unknown reliability Preponderance of journalistic, commercial and personal texts of unknown authorship and authority Details of sources and research methodology haphazard Even student papers (gasp) and machine translated texts (groan choke)

10 Challenges to Responsible Research (2) Representativity of Web as Corpus Much ill-formed or fragmentary language Domain only a rough clue to provenance Numbers vs. Statistics Search engines number of pages matching a query, not actual citations One page may contain alternate usages Narrower filters may eliminate some pages

11 Webidence as Evidence Our profession needs to develop “Standards of Webidence” to guide selection and documentation of online language for serious research purposes.

12 The Web is not a corpus in the classical sense… …but it does offer an inexhaustible body of linguistic and cultural information for research and use.

13 Why KWiCFinder? Automate process of search and retrieval Expedite evaluation of webpages Provide specific enhancements for foreign language users and linguists Encourage students and colleagues to take full advantage of online resources

14 Why AltaVista? All words are indexed, including "stopwords" Distinguishes case and "special characters" Supports Boolean operators, bracketing, and wildcards True world-wide coverage, with search by language No limits to length or complexity of the query Literal text search, without "second-guessing"

15 KWiCFinder Enhances AltaVista with… Intuitive input for foreign characters, bracketing, operators, dates Inclusion / exclusion criteria not included in KWiC report to focus search Automatic search and retrieval in the background returning KWiC abstracts

16 KWiCFinder Enhances AltaVista with… (2) Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars) “Sic” option so “plain” or lower-case char does not match “special” or upper- case variants: By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ

17 KWiCFinder Enhances AltaVista with… (3) “Tamecards” -- User inputs pattern, KF generates variants: on-line matches on-line, on line, online s[iau]ng matches sing, sang, sung {me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan

18 How Does XML Enhance KWiCFinder? Search results become a dynamic database for end user to manipulate: categorize, annotate, delete, merge / split searches, citations and documents Free tools permit developer or end-user to restyle and add interactivity to reports Layouts Languages Data format

19 Why WebKWiC? Original hope: cross-platform, cross- browser solution Minimal entry threshold: small download of HTML pages + JavaScript Support for non-Western European languages

20 Why Google? Link popularity ranking puts relevant sites at or near top of list Straightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independently Largest number of pages analyzed Matching pages always* available in cache with KWiC markup

21 How Does WebKWiC Complement Google? Focuses and enhances interface for language learners Provides tools to navigate among citations and documents Simplifies management of multiple windows

22 Future of Web Concordancing Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sites Multiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues

23 Pleas(e) Visit Download and try KWiCFinder and WebKWiC View bibliography as well as this and related presentations Use these tools with your students Send feedback and suggestions to

Download ppt "Concordancing the Web with KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American."

Similar presentations

Ads by Google