Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.

Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar

Computer Science Main “Data Source” in recent years is “World Wide Web” or other sources of text data Autonomous data generation We can’t force people to a specific format for data People want to present data with fewer words as possible. We will see structures that are illegal in language grammar or even they are not words. We will see rapid language changes, so we can’t use static models for language.

Computer Science (cont.) In this view computing the precision of language processing is based on frequency of words (on the other hand in Linguistics we have distinct words). Some examples of such applications American governmental programs:  Total Information Awareness (TIA) during 2003  Computer-Assisted Passenger Prescreening System (CAPPS II) till 2004 and assigns a color to each passenger  Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE) during 2004-2006 and as a component of a program with $47 million budget.

Computer Science (cont.)  Multistate Anti-Terrorism Information Exchange (MATRIX) till 2005. And many software vendors (based on 2008 reports)  Angoss Software, Infor CRM Epiphany, Kxen, Portrait Software, SAS, SPSS, ThinkAnalytics, Unica, Viscovery, … Although, we have applications that are more similar to Linguistics: Machine Translation Human-Computer interaction applications Text to Speech Text Simplification

Data Mining As “Text Data” view, Data Mining has three main steps: Pre-processing Preparing a representation for data that is suitable for next steps. Data Mining Indicating relevance of data in following views  Classification: arranging the data in predefined groups  Clustering: arranging the data in groups, but in this case we should find groups and they aren’t predefined.

Data Mining (cont.)  Regression: finding an equation that can describe the data model  Association Rule Learning: finding relations between concepts or main objects in data model. Interpreting the results We can guess that common research areas between “Computer Science” and “Linguistics” in this process are steps 1 and 3 (mainly step 1). In an example we can highlight it.

Search Engine Crawler Web Web Cache URL Queue WordNet Indexes Indexer Stemmer Ranking

Search Engine (cont.) It is the most popular application, most important example of using the data mining, one of high technologies and …. In pre-processing we have following tasks in search engines that focus on linguistic aspects of data: Computing importance factor of a word in a document Frequency TFIDF (Vector Space Model)

Search Engine (cont.) Stemming There are two main categories of approaches: Dictionary based and non-Dictionary based. Using tagging a word in a sentence for stemming Related words (works such as WordNet) Synonyms: Same meaning Hypernyms and hyponyms: General concepts and sub concepts. Homonyms: Same spelling but different meaning Acronyms: Abbreviations

Semantic Search Engine In a “Semantic Search Engine” main differences are as below: Indexing is not based on words, but on “Ontology” Ontology Extraction  Latent Semantic Indexing Ranking is not based on “Web Links”, but on “Similarity Between Pages”.

Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.

Similar presentations

Presentation on theme: "Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.

Similar presentations

Presentation on theme: "Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar."— Presentation transcript:

Similar presentations

About project

Feedback