Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supervisor: Mr. Phan Trường Lâm Supervisor:. Team information.

Similar presentations


Presentation on theme: "Supervisor: Mr. Phan Trường Lâm Supervisor:. Team information."— Presentation transcript:

1 Supervisor: Mr. Phan Trường Lâm Supervisor:

2 Team information

3 Agenda Introduction Project plan System Requirement Specifications System Analysis and Design Testing Deploy and User Guide Summary Demo and Q&A

4 Introduction Initial IdeaLiterature Review of Existing SystemProposal & Product 12345678

5 Initial Idea 12345678

6 12345678 We decide to develop a new system that integrated:  Collect documents  Organize these documents  Extract keyword  Ranking  Searching

7 Literature Review of Existing System  Methods that these websites use to build their systems: Big database Search Ranking and highlight return results Compare documents to detect plagiarism 12345678

8 Literature Review  Achievements of the existing systems Attractive Easy to use Speed & Reliability Quality Results Ensuring Security Awareness  Limitations of the existing systems  Costs  Privacy 12345678

9 Proposal Collect and manage Capstone projects Support looking up Capstone projects Avoid repeating and copying idea Ranking results Refer to other materials Friendly interface like Google Chipper to build Free to use Public for everyone Inside and outside University 12345678

10 Product (in future) Mobile application Web application 12345678

11 Project Plan Development environment Process Project organization Team management Project schedule 12345678 Risk management

12 Development Environment 12345678 1 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz 2 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz HARD WARE SOFT WARE

13 Process Follow Iterative model 12345678

14 Project organization 12345678

15 Team management  Controlling and Monitoring  Source code Code repository Subversion  Team member Meeting Assign task Tracking task Issue resolve Review task Report 12345678

16  Communication control  Online activity Email Google group Chat Phone  Offline activity Kick-Off project Daily and weekly meeting Working together from Mon to Sat Team building Team management 12345678

17 Project Schedule 12345678 Overall plan

18 Risk Management RiskManagement People risk Estimation risk Technology risk Requirement risk Schedule risk 12345678

19 System Requirement Specifications 12345678 User Requirements System Requirements Non-functional requirements

20 User Requirements 12345678  Lecturers and Students: Search project documents. Download documents.  Librarians: Edit profile. Add/Edit/Delete document. Add/Edit/Delete category.  Administrator Edit profile. Add/Edit/Delete account.

21 User Requirements 12345678  Other requirement Have common and option search Searched results will be ranked. Document has following information:  Name  Author name  Supervisor name  Description  And category

22 User Requirements 12345678 System input included:  Keyword file  Abstract file  Full document file  Other materials

23 System Requirements 12345678 External interface requirement  User interface: easy to use  Communicate via the protocol HTTP to complete interactions based on service with client computers and use standard protocols.  Hardware  Server: Windows Server 2008 operating system.NET framework 3.5 SQL server 2008 IIS 7  Client: Web browser

24 Non-functional Requirements 12345678 Usability Availability Security Reliability Performance Security Maintainability

25 System Analysis and Design 12345678 Architectural design Detailed design Database design

26 Architectural design 12345678 Overall architecture MVC architecture design pattern

27 Detailed design 12345678 “CProDM” Component Diagram

28 Database design 12345678 Entity diagram Sua hinh

29 Extract Keyword Algorithm 12345678 Introduction Study Algorithm Evaluation Improve Algorithm Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information (YUTAKA MATSUO and MITSURU ISHIZUKA)

30 Algorithm – What is the keyword? 12345678 Position Meaning Frequency

31 Algorithm – Step by step 12345678 Preprocessing Processing Discard stop words Stem Extract frequency Calculate X’ 2 value Calculate X’ 2 value Output Expected probability Select frequent term

32 Algorithm – Studying 12345678 Original Text Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Example: Information powerful weapon modern society day overflowed huge amount data electronic newspaper articles emails web pages search results Often information receive incomplete such further search activities required enable correct interpretation usage information Stemmed Words Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Discarded Stop Words Step1 Step2 Using Porter Stemming Algorithm Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Informat power weapon modern societi day overflow huge amoun data electronic newspaper articl email web page search result Often informat receive incomplet such further search activ requir enable correct interpret usag informat

33 Algorithm – Studying 12345678 Original Text Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Example: Information powerful weapon modern society day overflowed huge amount data electronic newspaper articles emails web pages search results Often information receive incomplete such further search activities required enable correct interpretation usage information Stemmed Words Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Discarded Stop Words Informat power weapon modern societi day overflow huge amoun data electronic newspaper articl email web page search result Often informat receive incomplet such further search activ requir enable correct interpret usag informat Step1 Step2 Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Using Porter Stemming Algorithm

34 Algorithm – Studying 12345678 The top ten frequent terms (denoted as G) and the probability of occurrence, normalized so that the sum is to be 1. Select frequent Term As study, number of keyword is about 10% number of term in document and no more than 30 terms.

35 Algorithm – Studying 12345678 Two terms in a sentence are considered to co-occur once. Co-occurrence and Importance Example The imitation game could then be played with the machine in question (as B) and the mimicking digital computer (as A) and the interrogator would be unable to distinguish them.

36 Algorithm – Studying 12345678 Co-occurrence and Importance

37 Algorithm – Studying 12345678 The degree of biases of co-occurrence can be used as a indicator of term importance Co-occurrence and Importance

38 Algorithm – Studying 12345678 The statistical value of χ 2 is defined as p g Unconditional probability of a frequent term g ∈ G (the expected probability) n w The total number of co-occurrence of term w and frequent terms G freq (w, g) Frequency of co-occurrence of term w and term g

39 Algorithm – Studying 12345678 p g (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) n w The total number of terms in the sentences where w appears including w We consider the length of each sentence and revise our definitions

40 12345678 Algorithm – Studying

41 12345678 the following function to measure robustness of bias values Subtracts the maximal term from the 2 value

42 12345678 Algorithm – Studying

43 Algorithm – Evaluation 12345678 Precision: Ratio of right keyword to number of keywordCoverage: Ratio of indispensable keyword in list to all the indispensable terms Frequency index: average frequency of keyword in list

44 Ranking – Why? 12345678 Ranking Result

45 Ranking 12345678

46 Ranking 12345678 Use rank calculate formula Term in a collection documents: ( Automatic Keyword Extraction for Database Search First examiner : Prof. Dr. techn. Dipl.-Ing. Wolfgang Nejdl Second examiner : Prof. Dr. Heribert Vollmer Supervisor : MSc. Dipl.-Inf. Elena Demidova ) R(t) = Fd(t)*log(1 + N/N(t)) (1) Rank of Term t in all the collection Total number of documents in the collection Frequency of Term t in the given document Total number of documents that contain Term t Ranking formula : Rank = d * Rd(t) / R(t) (2) =>Rank = d * Rd(t) / (Fd(t)*log(1 + N/N(t))) (3) reliability coefficient Rank of Term t in document, which extracted by Extract Service

47 Ranking 12345678

48 Testing 12345678

49 Testing 12345678 NoTesterModule codePassFailUntestedN/ANumber of test cases 1 AnhNT Master Page 18000 2 AnhNT Home Page 12000 3AnhNT Search Result 50005 4 AnhNT User Account 69000 5AnhNT Error Page 80008 6 NamH Category 36000 7NamH Document 47000 8 NamH Authenticated 81000 9NamH User Document Detail 90009 Sub total285000 Test coverage 100.00% Test successful coverage 100.00% Test result

50 Deployment

51 User guide 12345678

52 12345678 To improve extracted keyword quality, we will cluster terms Two major approaches (Hofmann & Puzicha 1998) are:  Similarity-based clustering If terms w1 and w2 have similar distribution of co- occurrence with other terms, w1 and w2 are considered to be the same cluster.  Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. Improvement Them vi du

53 12345678 Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Green Circles Improvement

54 12345678 Where: Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is and: Improvement

55 12345678 Cluster a pair of terms whose mutual information is Pairwise clustering Where: Improvement

56 Summary 12345678  Strong point Creative Active Cope with change  Weak point Lack of technical skill Lack of management skills  Lessons learned Improve technical & management skills Release on-time product with the restriction of time and resource Improve communication skills & problem solving

57 12345678 Demo & Q&A

58


Download ppt "Supervisor: Mr. Phan Trường Lâm Supervisor:. Team information."

Similar presentations


Ads by Google