Joint Visual-Text Modeling for Multimedia Retrieval
Transcription
Joint Visual-Text Modeling for Multimedia Retrieval
Big Picture: Multimedia Retrieval Task Joint Visual-Text Modeling for Multimedia Retrieval Find clips showing Yasser Arafat Process Query Text Yasser Arafat Process Query Image Joint Visual-Text Modeling Document of Words and Visterms Retrieve documents using p(Document|Query) Yasser Arafat Process Query Image Query of Words and Visterms Retrieval VIDEO CLIPS Joint wordvisterm retrieval “ … [Yes sir, you’re fat today said]… Evaluation Manual Selection of System Parameters Ranked list of video shots Manual Interactive 4 Rank documents with p(qw,qv|dw,dv) d Words Visterms Ranked list of video shots Isolate Algorithmic issues from interface and user issues 5 q Visterms Words Automatic Selection of System Parameters 2 Language Model d based Retrieval Our search task definition Statement of Information need + Examples “ …“Palestinian … Palestinian leader leader YesYasser Sir You’re Arafat Fattoday todaysaid said…” …” Evaluation NIST Statement of Information need + Examples 3 NIST Content-based Combine Scores TRECVID Search task definition Retrieve documents using p(dw,dv | qw,qv) Process Query Text Spoken Document Retrieval Image VIDEO CLIPS Joint-Visual Text Models! 1 Query of words “ … Palestinian leader Yasser Arafat today said …” Most research has addressed: I. Text queries, text (or degraded text) documents II. Image queries, image data Résumé du JHU Workshop 2004 Pour le Master USTV Document of words Multimedia Retrieval System Baseline model Relating document visterms to query words (MT,Relevance Model,HMMs) Relating document words to query images (Text Classification experiments) Visual-only retrieval models 6 1 Experimental Setup: Visual Features Interest Point Neighborhoods (Harris detector) Original Greyscale image Edge Strength Co-occurrence 7 Experimental Setup: Visual Feature list Presentationd Outline Regular partition Translation (MT) models (Paola), d L*a*b Moments (COLOR) Smoothed Edge Orientation Histogram (EDGE) Grey-level Co-occurrence matrix (TEXTURE) Words Visterms q Interest Point neighborhood 8 COLOR, EDGE, TEXTURE Visterms Words L*a*b Interest points Relevance Models (Shao Lei,Desislava), Graphical Models (Pavel, Brock) Text classification models (Matt) Integration & Summary (Dietrich) 9 Inspiration from Machine Translation 10 Discrete Representation of Image Regions (visterms) to create analogy to MT In Machine Translation Æ discrete tokens In our task sun sky waves sea concepts However, the features extracted from regions are continuous Solution : Vector quantization Æ visterms grass grass grass grass grass grass grass grass grass tiger tiger tiger grass tiger tiger tiger p(f|e) = ∑ p(f,a|e) a {fn1, fn2, …fnm} -> vk p(c|v) = ∑ p(c,a|v) a Direct translation model v10 v22 v35 v43 v10 v22 v35 v43 c5 c1 c38 c71 sun sky sea waves v20 v21 v50 v10 v20 v21 v50 v10 c15 c21 c83 tiger water grass v78 v1 v78 v1 v78 v78 v1 v1 c21 c19 c1 c56 c38 water harbor sky clouds sea 12 13 2 Image annotation using translation probabilities p(c|v) : Probabilities obtained from direct translation P0 (c | dV ) = 1 dV p (sun | Annotation Results (Corel set) ) ∑ P (c | v ) v∈dV field foals horses mare tree horses foals mare field v10 v22 v35 v43 people pool swimmers water swimmers pool people water sky 14 Images are defined by spatial context. R |d v | P(c, d v ) = ∑ P( J ) P(c | J )∏ P (vi | J ) J ∈T 16 Annotation Examples (Corel set) Tree plane zebra herd water 15 Two parallel vocabularies: Words and Visterms Analogous to Cross – lingual relevance models Estimate the joint probabilities of words and visterms from training images Less likely to be associated with computers. Sky train railroad locomotive water people sand sky water sky water beach people hills Top: manual annotations, bottom : predicted words (top 5 words with the highest probability) Red : correct matches Isolated pixels have no meaning. Context simplifies recognition/retrieval. E.g.Tiger is associated with grass, tree, water forest. jet plane sky sky plane jet tree clouds mountain sky snow water sky mountain water clouds snow Cross Media Relevance Models (CMRM) Intuition flowers leaf petals stems flowers leaf petals grass tulip Cat tiger bengal tree forest Birds leaf nest water sky i =1 J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Relevance Using Cross-Media Relevance Models, In Proc. SIGIR’03. Tiger Tree Grass 17 Proposal: Using Dynamic Information for Video Retrieval Snow fox arctic tails water Mountain plane jet water sky 18 19 3 Motivation Current models based on single frames in each shot. But video is dynamic Why Dynamic Information Model actions/events Many Trecvid 2003 queries require motion information. E.g. Has motion information. Use dynamic (motion) information Motion is an important cue for retrieving actions/events. Better image representations (segmentations) Model events/actions find shots of an airplane taking off. find shots of a person diving into water. But using the optical flow over the entire image doesn’t help. Use motion features from objects. Better Image Representations Much easier to segment moving objects from background than to segment static images. 20 21 Segmentation Comparison Hidden Markov Models for Image Annotations a: Segmentation using only still image information b: Segmentation using only motion information Pictures from Patrick Bouthemy’s Website, INRIA 22 Presentationd Outline d q Visterms Words Words Visterms 23 Inadequacy of the annotations Translation (MT) models (Paola), Relevance Models (Shao Lei,Desislava), Graphical Models (Pavel, Brock) Text classification models (Matt) Corel database Annotators often mark only interesting objects TRECVID database Annotation concepts capture mostly semantics of the image and they are not very suitable for describing visual properties man-made object Integration & Summary (Dietrich) 24 beach palm people tree car transportation vehicle outdoors non-studio setting nature-non-vegetation snow 25 4 Alignment problems Gradual Training Results There is no notion of order in the annotation words Difficulties with automatic alignment between words and image regions Results: Improved alignment of training images Annotation performance on test images did not change significantly 26 27 Joint Segmentation and Labeling Joint Segmentation and Labeling tiger, grass, sky tiger, grass, sky 28 Joint Segmentation and Labeling 29 Joint Segmentation and Labeling sky grass tiger sky grass tiger tiger, grass, sky tiger, grass, sky 30 31 5 Building Features Predicting Visual Concepts From Text Insert Sentence Boundaries Case Restoration Noun Extraction Named Entity Detection WordNet Processing Feature Set 32 ASR Æ Features Example 33 ASR Æ Features Example STEVE FOSSETT AND HIS BALLOON SOLO SPIRIT ARSENIDE OVER THE BLACK SEA DRIFTING SLOWLY TOWARDS THE COAST OF THE CAUCUSES HIS TEAM PLANS IF NECESSARY TO BRING HIM DOWN AFTER DAYLIGHT TOMORROW YOU THE CHECHEN CAPITAL OF GROZNY STEVE FOSSETT AND HIS BALLOON SOLO SPIRIT ARSENIDE. OVER THE BLACK SEA DRIFTING SLOWLY TOWARDS THE COAST OF THE CAUCUSES. HIS TEAM PLANS IF NECESSARY TO BRING HIM DOWN AFTER DAYLIGHT TOMORROW. YOU THE CHECHEN CAPITAL OF GROZNY 34 35 ASR Æ Features Example ASR Æ Features Example Steve Fossett and his balloon Solo Spirit arsenide. Steve Fossett and his balloon Solo Spirit arsenide. Over the Black Sea drifting slowly towards the coast of the caucuses. Over the Black Sea drifting slowly towards the coast of the caucuses. His team plans if necessary to bring him down after daylight tomorrow. His team plans if necessary to bring him down after daylight tomorrow. you the Chechan capital of Grozny…. you the Chechan capital of Grozny. 36 Named Entities Male Person, Location (Region) 37 6 ASR Æ Features Example Steve Fossett and his balloon Solo Spirit arsenide. ASR Æ Features Example Named Entities Steve Fossett and his balloon Solo Spirit arsenide. Male Person, Location (Region) Over the Black Sea drifting slowly towards the coast of the caucuses. Over the Black Sea drifting slowly towards the coast of the caucuses. His team plans if necessary to bring him down after daylight tomorrow. His team plans if necessary to bring him down after daylight tomorrow. you the Chechan capital of Grozny. you the Chechan capital of Grozny. Named Entities Nouns Male Person, Location (Region) balloon, solo, spirit, coast, caucus, team, daylight, Chechan, capital, Grozny WordNet nature 38 39 Will this help for retrieval? person, water_body, non-studio_setting, nature_non-vegetation, person_action, indoors “Find shots of the front of the White House in the daytime with the fountain running.” Joint Visual-Text Video OCR “Find shots of a person diving into some water.” building, outdoors, sky, water_body, cityscape, house, nature_vegetation “Find shots of Congressman Mark Souder.” person, face, indoors, briefing_room_setting, text_overlay 40 Motivation 41 Motivation “Find shots of Congressman Mark Souder” 42 “Find shots of a graphic of Dow Jones Industrial Average showing a rise for one day. The number of points risen that day must be visible.” 43 7 Motivation Motivation Find shots of the Tomb of the Unknown Soldier in Arlington National Cemetery. WEIFll I1 NFWdJ TNNIF H 44 Why use video OCR? 45 Why use video OCR? Text overlays provide high precision information about query-relevant concepts in the current image. 46 Image Processing Preprocessing Feature extraction 47 Proposal: HMM-based recognizer Normalize the text region’s height Color Edge Strength and Orientation 48 c1 c2 c3 c4 M A I T c5 c6 K 49 8 CONCLUSION : Resumé des requêtes multi-modales Méthodes Document Visterms dv p(qw | d v ) p(qv | d w ) p(qv | d v ) Query p(qw | d w ) Words dw 50 Visterms qv Words qw Query Visterms qv Words qw Words dw Document Visterms dv p(qw | d v ) p(qw | d w ) •MT •Relevance Models •HMM p(qv | d w ) •Naïve Bayes •Max. Ent •LM •SVM, Ada Boost, … p(qv | d v ) 51 9