Joint Visual-Text Modeling for Multimedia Retrieval

Transcription

Big Picture: Multimedia Retrieval
Task
Joint Visual-Text
Modeling for Multimedia
Retrieval
Find clips
showing
Yasser Arafat
Process
Query Text
Yasser Arafat
Process
Query Image
Joint Visual-Text Modeling
Document of
Words and Visterms
Retrieve documents using
p(Document|Query)
Yasser Arafat
Process
Query Image
Query of
Words and Visterms
Retrieval
VIDEO CLIPS
Joint wordvisterm
retrieval
“ … [Yes sir, you’re fat today said]…
Evaluation
Manual
Selection of
System
Parameters
Ranked
list of
video shots
Manual
Interactive
4
Rank documents with p(qw,qv|dw,dv)
d
Words Visterms
Ranked
list of
video shots
Isolate Algorithmic issues from interface and user issues
5
q
Visterms Words
Automatic
Selection of
System
Parameters
2
Language Model
d based Retrieval
Our search task definition
Statement of
Information
need +
Examples
“ …“Palestinian
… Palestinian
leader
leader
YesYasser
Sir You’re
Arafat
Fattoday
todaysaid
said…”
…”
Evaluation
NIST
Statement of
Information
need +
Examples
3
NIST
Content-based
Combine
Scores
TRECVID Search task definition
Retrieve documents using
p(dw,dv | qw,qv)
Process
Query Text
Spoken
Document
Retrieval
Image
VIDEO CLIPS
Joint-Visual Text Models!
1
Query
of
words
“ … Palestinian leader
Yasser Arafat today said …”
Most research has addressed:
I. Text queries, text (or degraded text) documents
II. Image queries, image data
Résumé du JHU Workshop 2004
Pour le Master USTV
Document
of
words
Multimedia Retrieval
System
Baseline model
Relating document
visterms to query
words (MT,Relevance
Model,HMMs)
Relating document words
to query images (Text
Classification
experiments)
Visual-only retrieval models
6
1
Experimental Setup: Visual
Features
Interest Point Neighborhoods
(Harris detector)
Original
Greyscale image
Edge Strength
Co-occurrence
7
Experimental Setup: Visual
Feature list

Presentationd Outline
Regular partition

Translation (MT) models
(Paola),
d
L*a*b Moments (COLOR)
Smoothed Edge Orientation Histogram
(EDGE)
Grey-level Co-occurrence matrix
(TEXTURE)
Words Visterms
q
Interest Point neighborhood

8
COLOR, EDGE, TEXTURE
Visterms Words
L*a*b
Interest points
Relevance Models
(Shao Lei,Desislava),
Graphical Models
(Pavel, Brock)
Text classification models
(Matt)
Integration & Summary
(Dietrich)
9
Inspiration from Machine Translation
10
Discrete Representation of Image Regions
(visterms) to create analogy to MT
In Machine Translation Æ discrete tokens
In our task
sun sky waves sea
concepts
However, the features extracted from regions
are continuous
Solution : Vector quantization Æ visterms
grass grass grass grass
grass grass grass grass
grass tiger tiger tiger
grass tiger tiger tiger
p(f|e) = ∑ p(f,a|e)
a
{fn1, fn2, …fnm} -> vk
p(c|v) = ∑ p(c,a|v)
a
Direct translation model
v10
v22
v35
v43
v10 v22 v35 v43
c5 c1 c38 c71
sun sky sea waves
v20
v21
v50
v10
v20 v21 v50 v10
c15 c21 c83
tiger water grass
v78
v1
v78
v1
v78 v78 v1 v1
c21 c19 c1 c56 c38
water harbor sky clouds sea
12
13
2
Image annotation using translation
probabilities
p(c|v) : Probabilities obtained from direct translation
P0 (c | dV ) =
1
dV
p (sun |
Annotation Results (Corel set)
)
∑ P (c | v )
v∈dV
field foals horses mare
tree horses foals mare field
v10
v22
v35
v43
people pool swimmers water
swimmers pool people water sky
14
Images are defined by spatial context.

R
|d v |
P(c, d v ) = ∑ P( J ) P(c | J )∏ P (vi | J )
J ∈T
16
Annotation Examples (Corel set)
Tree plane zebra
herd water
15
Two parallel vocabularies: Words and Visterms
Analogous to Cross – lingual relevance models
Estimate the joint probabilities
of words and visterms from
training images
Less likely to be associated with computers.
Sky train railroad
locomotive water
people sand sky water
sky water beach people hills
Top: manual annotations, bottom : predicted words (top 5 words with the highest probability)
Red : correct matches

Isolated pixels have no meaning.
Context simplifies recognition/retrieval.
E.g.Tiger is associated with grass, tree,
water forest.

jet plane sky
sky plane jet tree clouds
mountain sky snow water
sky mountain water clouds snow
Cross Media Relevance Models
(CMRM)
Intuition

flowers leaf petals stems
flowers leaf petals grass tulip
Cat tiger bengal
tree forest
Birds leaf nest water
sky
i =1
J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation
and Relevance Using Cross-Media Relevance Models, In Proc. SIGIR’03.
Tiger
Tree
Grass
17
Proposal: Using Dynamic
Information for Video
Retrieval
Snow fox arctic tails
water
Mountain plane
jet water sky
18
19
3
Motivation

Current models based on single frames
in each shot.
But video is dynamic

Why Dynamic Information

Model actions/events
Many Trecvid 2003 queries require motion
information. E.g.

Has motion information.

Use dynamic (motion) information

Motion is an important cue for retrieving
actions/events.

Better image representations
(segmentations)
Model events/actions

find shots of an airplane taking off.
find shots of a person diving into water.
But using the optical flow over the entire image doesn’t
help.
Use motion features from objects.
Better Image Representations

Much easier to segment moving objects from background
than to segment static images.
20
21
Segmentation Comparison
Hidden Markov Models
for Image Annotations
a: Segmentation using only still image information
b: Segmentation using only motion information
Pictures from Patrick Bouthemy’s Website, INRIA
22
Presentationd Outline
d
q
Visterms Words
Words Visterms
23
Inadequacy of the annotations
Translation (MT) models
(Paola),

Relevance Models
(Shao Lei,Desislava),

Graphical Models
(Pavel, Brock)
Text classification models
(Matt)
Corel database

Annotators often mark
only interesting objects
TRECVID database

Annotation concepts capture mostly semantics of the
image and they are not very suitable for describing visual
properties
man-made object
Integration & Summary
(Dietrich)
24
beach
palm
people
tree
car
transportation
vehicle
outdoors
non-studio setting
nature-non-vegetation
snow
25
4
Alignment problems

Gradual Training Results
There is no notion of order in the annotation words

Difficulties with automatic alignment between words and
image regions
Results:

Improved alignment of training images

Annotation performance on test images did not change
significantly
26
27
Joint Segmentation and Labeling
tiger, grass, sky
tiger, grass, sky
28
29
sky
grass
tiger
sky
grass
tiger
tiger, grass, sky
tiger, grass, sky
30
31
5
Building Features
Predicting Visual
Concepts From Text
Insert Sentence Boundaries
Case Restoration
Noun Extraction
Named Entity Detection
WordNet Processing
Feature Set
32
ASR Æ Features Example
33
STEVE FOSSETT AND HIS BALLOON SOLO
SPIRIT ARSENIDE OVER THE BLACK SEA
DRIFTING SLOWLY TOWARDS THE COAST
OF THE CAUCUSES HIS TEAM PLANS IF
NECESSARY TO BRING HIM DOWN AFTER
DAYLIGHT TOMORROW YOU THE CHECHEN
CAPITAL OF GROZNY
STEVE FOSSETT AND HIS
BALLOON SOLO SPIRIT
ARSENIDE.
OVER THE BLACK SEA
DRIFTING SLOWLY
TOWARDS THE COAST
OF THE CAUCUSES.
HIS TEAM PLANS IF
NECESSARY TO BRING HIM
DOWN AFTER DAYLIGHT
TOMORROW.
YOU THE CHECHEN CAPITAL
OF GROZNY
34
35
Steve Fossett and his balloon
Solo Spirit arsenide.
Over the Black Sea drifting
slowly towards the coast of the
caucuses.
caucuses.
His team plans if necessary to
bring him down after daylight
tomorrow.
tomorrow.
you the Chechan capital of
Grozny….
Grozny.
36

Named Entities

Male Person, Location (Region)
37
6

Named Entities

caucuses.
caucuses.
tomorrow.
tomorrow.
Grozny.
Grozny.

Named Entities

Nouns

balloon, solo, spirit, coast,
caucus, team, daylight,
Chechan, capital, Grozny
WordNet

nature
38
39
Will this help for retrieval?

person, water_body, non-studio_setting,
nature_non-vegetation, person_action, indoors
“Find shots of the front of the White House
in the daytime with the fountain running.”

Joint Visual-Text
Video OCR
“Find shots of a person diving into some
water.”
building, outdoors, sky, water_body, cityscape,
house, nature_vegetation
“Find shots of Congressman Mark Souder.”

person, face, indoors, briefing_room_setting,
text_overlay
40
Motivation

41
Motivation
“Find shots of Congressman Mark Souder”

42
“Find shots of a graphic of Dow Jones
Industrial Average showing a rise for one
day. The number of points risen that day
must be visible.”
43
7
Motivation

Motivation
Find shots of the Tomb of the Unknown
Soldier in Arlington National Cemetery.
WEIFll I1 NFWdJ TNNIF H
44
Why use video OCR?
45
Why use video OCR?

Text overlays provide high precision
information about query-relevant
concepts in the current image.
46
Image Processing

Preprocessing

Feature extraction

47
Proposal: HMM-based recognizer
Normalize the text region’s height
Color
Edge Strength and Orientation
48
c1
c2
c3
c4
M
A
I
T
c5
c6
K
49
8
CONCLUSION :
Resumé des requêtes multi-modales
Méthodes
Document
Visterms dv
p(qw | d v )
p(qv | d w )
p(qv | d v )
Query
p(qw | d w )
Words dw
50
Visterms qv Words qw
Query
Visterms qv Words qw
Words dw
Document
Visterms dv
p(qw | d v )
p(qw | d w )
•MT
•Relevance Models
•HMM
p(qv | d w )
•Naïve Bayes
•Max. Ent
•LM
•SVM, Ada Boost, …
p(qv | d v )
51
9