Presentation on theme: "C ONTEXTUAL Q UERY U SING B ELL T ESTS J OAO B ARROS, Z ENO T OFFANO, Y OUSSEF M EGUEBLI AND B ICH -L IÊN D OAN SUPELEC (É COLE S UPÉRIEURE D ’É LECTRICITÉ."— Presentation transcript:
C ONTEXTUAL Q UERY U SING B ELL T ESTS J OAO B ARROS, Z ENO T OFFANO, Y OUSSEF M EGUEBLI AND B ICH -L IÊN D OAN SUPELEC (É COLE S UPÉRIEURE D ’É LECTRICITÉ ) FRANCE Quantum Interaction 2013 Leicester July 25-27, 2013
Q UANTUM I NTERACTION R ESEARCH AT SUPELEC Research activity initiated in 2011 under the impulse of Bich-Liên DOAN (Computer Science: Information Retrieval, Semantic Web), Dep. Of Computer Science. Zeno TOFFANO (Physicist, lectures Quantum Mechanics. Research: Solid State Physics, NMR, Lasers, Fiber Optics Telécommunications), Dep. Of Telecommunications. 2 PhDs Joao BARROS (MsC: Theoretical Physics) at the heart of this research (funding from « Fondation SUPELEC »). Youssef MEGUEBLI (MsC: Computer Science) : Opinion based Information retrieval. We undertook some preliminary investigations in the form of tests. (arXiv:1207.4328)arXiv:1207.4328 We emphasize on the experimental « Quantum-like » approach.
P RELIMINARY I NVESTIGATIONS : P OLL TESTS ON POLYSEMY IN FOREIGN LANGUAGE : The test assesses correlations in a foreign language (Chinese here). It aims to show the role of polysemy of words. The question was to quantify the correlation with different meanings of the proposed words. 4 people were interviewed (all Chinese) to give their opinion scores. POLL TEST 4 persons chinese wordpolysemyScores 笔记本 laptop computer9 9 6 4 paper notebook4 6 6 7 性 sex9 9 5 6 character2 6 6 5 生 life5 5 8 5 to be born9 9 6 7 清 Qing dynasty8 6 4 8 fair end honest2 5 6 3 出入 go in and out8 6 6 6 the failure to agree4 6 9 6
P RELIMINARY I NVESTIGATIONS : P OLL TESTS ON HETEROGENEOUS MEDIA The test proposes nine musical excerpts. The question is to rate from 0 to 10 whether these excerpts fall under the category "rock" or "blues ". 4 persons were interviewed. The sum of the results for both categories is very rarely equal to 10 indicating that the chosen categories are certainly not mutually exclusive. Interference effects between concepts of different media (over-extension/under-extension). POLL TEST “Music Excerpts belonging to”“Rock”“Blues”“Rock” + “Blues” Interference « Dazed and confused »5.54.510 = « Susie Q »82.510.5 over « That's all right »4.55.7510.25 over « Folsom prison blues »448 under « The wind cries Mary »4.756.2511 over « Don't let me down »8.253.2511 over « Tenth avenue freeze out »3.53.757.25 under « Since I've been loving you »2810 = « I heard it through the grapevine»7.53.2510.75 over
P RELIMINARY I NVESTIGATIONS W ORDS BELONGING TO 2 CATEGORIES : CORRELATION POLL TEST max: 1 max: 4 “Word belonging to”µ(Fruit)µ(Veg.)µ(F or V) S Bell garlic.16.52.33 0.39 almond.68.07.83 0.82 beet.2.4.53 0.35 broccoli.04.921 0.93 mushroom.06.75.6 0.36 cauliflower.06.92.93 0.86 cucumber.18.72.83 0.61 gherkin.26.47.8 0.41 spinach.04.95.8 0.75 bean.1.97.87 0.84 coco nut.96.02.6 1.36 olive.6.4.67 0.77 parsley.14.35.47 0.37 pepper.08.050 1.03 potato.16.45.17 0.62 apple.94.031 0.94 ginger.1.22.26 0.63 grapes1.021 1 tomato.8.61 0.92 Tests on words belonging to two categories Fruit, Vegetable or to both (questions are independent) We define a « Bell-like » correlation parameter In this analysis we observed no violation of the Bell Inequality (<2)
P RELIMINARY I NVESTIGATIONS F IRST APPROACH : « H EURISTIC Q UANTUM - LIKE HAL MODEL » Document n°123 Tomato AND Fruit0.7880.5810.373score p TF Tomato AND Vegetable0.3490.4690.213score p TV Plant AND Fruit0.65100.385score p PF Plant AND Vegetable0.31500.223score p PV 0.9472.231.105 Bell param. S
B ELL I NEQUALITIES Long story : The field of Bell inequality violations (Bell 1964) and entanglement has fascinated many scientists throughout the last decades. An interesting historical narrative is in “How the Hippies Saved Physics” by David Kaiser, Ed. W. W. Norton (Physics World 2012 Book of the Year ). Much debate classical and non-classical behaviour entanglement local and non-local, contextual and non-contextual more than Quantum, non-local boxes… Experiments demonstrating Bell inequality violation 1969 Clauser: first experiment 1982 A. Aspect (Orsay France) on polarized photons: definitive proof Entanglement with Spins (NMR, Rydberg atoms…) Towards the realization of a Quantum Computer A new field: Quantum Information Entanglement is at the heart of this field because it is seen as a potential “resource” for computing (lower complexity) and coding (secure cryptography)
HAL AND QI RESEARCH We investigate the relationships between words within a document; these relationships can be formed by creating a “ semantic space ” using the Hyperspace Analogue Language (HAL) introduced by Lund and Burgess (1996). The HAL algorithm does not require any explicit human a-priori judgment. In the procedure a HAL lexical co-occurence matrix is built with a "window," representing a span of words passed over the corpus being analyzed. Operationally, two words are considered as co-occurring when they appear in the same floating window. The size of this window is a few hits left and right of the word in question. Similar approach: LSA (Latent Semantic Analysis) also builds matrices in semantic space. Darányi, Wittek, Physical analogy between semantic space of HAL and Quantum Theory, where at each word can be associated a given energy (in analogy with spectral emission lines in atoms corresponding to transition energies) Bruza HAL used for analogies with Quantum Theory for activating associations of concepts.
T HE HAL M ATRIX S EMANTIC S PACE The matrix is built with a " window " representing a span of words passed over the corpus being analyzed. The width of this window can be varied. Words within the window are recorded as co-occurring with weight inversely proportional to the number of other words separating them within the window (word distance measure). The information contained in a line is the sum of co-occurrences for words appearing before the word, the information contained in a column represents the sum of co- occurrence for the words appearing after the word. We used a symmetric real positive matrix obtained by the sum of the HAL matrix and its transpose (equivalent to run HAL backwards). All words are considered and simple plurals are treated as singular words. Lower and upper case letters are not distinguished. Words having the same origin are treated differently (for example “battle” and “battling” are distinct).
D OCUMENT « O RANGE » CONSTRUCTION OF THE HAL M ATRIX Symmetric matrix sum of two HAL matrices (forward and backward). Repeated words contribute to strengthen the associated vector (see “orange” and “the” in the example below). The rows and columns of the symmetric co-occurrence matrix constitute vectors in a high- dimensional space. The dimensionality of the space is determined by the number of columns in the matrix (context vectors). TEXT example with a window spanning on 3 words ( l = 3) "THE COLOUR ORANGE TAKES ITS NAME FROM THE ORANGE FRUIT" Matrix ( l =3) M+M^TTHECOLOURORANGETAKESITSNAMEFROMFRUIT THE 163511232 COLOUR 38321000 ORANGE 531632223 TAKES 12383210 ITS 11238320 NAME 20223830 FROM 30212381 FRUIT 20300018
Q UANTUM MODEL FOR HAL : Q UERY O PERATOR BASIS REPRESENTATION
B ELL PARAMETER CALCULATION USING Q UERY O PERATORS Using specific operators associated to words A and B. This particular operator choice is inspired from the usual example that maximizes the violation of the Bell inequalities.
B ELL PARAMETER CALCULATION : Q-HAL ALGORITHM Construction of a “clean” (no punctuation marks) sequence of words, including eventual repeated words: Doc list. Construction of the “Dictionary”: sequence of non repeated words: Dic list. Input Document Construction of a primitive HAL matrix: for each word of the Doc list a window of length l is associated and all the scores of the words within it are collected in a matrix. The entry for each score is determined by the position of the words in the Dic list. Complete HAL matrix is obtained by summing this matrix with its transpose. Calculation of the expected values of the defined operators and the Bell parameter. Plot Normalization of each row vector. Determination of the state of the system by summing over all vectors and normalizing. Window size l New window size l+1 Flow diagram of the Quantum HAL algorithm. The algorithm was implemented using Python programming language along with the string module and pylab. Our approach presented here can be perceived as an experiment done on objects outside the domain of physics.
Q UERY 1 : WORD “R EAGAN ” IN THE CONTEXT OF WORD “I RAN ”.
Q UERY 2: T EST ON THE POLYSEMY OF THE WORD “ ORANGE ”
Q UERY : PATHOLOGICAL TEXT EXAMPLE We made 2 word queries on « pathological » documents consisting in texts with repeating periodic structure based on the same original document. The curves still peak at the Tsirelson’s bound and also present other effects probably due to the repetition period. Queries on words A 1 and A 100 in a text of 5000 words, for text repetition periodicities of 100, 150 and 200.
C OMMENTS ON R ESULTS The results show Bell parameter that peaks up to the maximal value of S bell = 2√2, (the Tsirelson’s bound). We found that the Bell parameter is strongly dependent on the HAL window size. There is an optimal window size that maximizes S bell. Reminiscent of what was already noticed (Bruza) a possible explanation : if the window size is set too large, spurious co-occurrence associations are represented in the matrix if the window size is too small, relevant associations may be missed. Comparing different documents, the one with the first appearing peak seems to be the more relevant.