Ike Antkare - Com`Eau Labo

Transcription

Ike Antkare - Com`Eau Labo
Ike Antkare : Génèse et échos
Cyril Labbé
Université Joseph Fourier - LIG
Octobre 2014
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
1 / 24
1
Preliminaries
Scientometrics
SCIgen a Probabilistic Context Free Grammar
2
Ike Antkare, one of the great starts in the scientific firmament
3
Detection of SCIgen papers : June 2012
Google Search
Automatic classification
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
2 / 24
Preliminaries
Table of Contents
1
Preliminaries
Scientometrics
SCIgen a Probabilistic Context Free Grammar
2
Ike Antkare, one of the great starts in the scientific firmament
3
Detection of SCIgen papers : June 2012
Google Search
Automatic classification
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
3 / 24
Preliminaries
Scientometrics
Ranking scientists and journals
Number of citations
Definition (h-index [Hirsch, 2005])
A scientist has index h if h of his or
her Np papers have at least h
citations each and the other (Np h)
papers have  h citations each.
h
Papers
0
Np
Citations
Definition (Impact Factor)
Average number of citations to
papers published by the journal over
the last two years. Computed since
1975.
C.Labbé (UJF-LIG)
h
2 years
Ike Antkare & Co
Time after publication
Octobre 2014
4 / 24
Preliminaries
Scientometrics
Tools that count citations.
Toll based tools.
Provided by publisher (Elsevier, Thomson reuters);
Based on publishers catalogs (ACM, IEEE, Springer, Elsevier,...);
Selected venues only ( all peer reviewed).
Free tools:
Google Scholar, CiteSeerX,...
Crawling the web and/or selected catalogs and/or added by users;
Social media (Google+, Scholarometer, Microsoft Academics...).
Free tools that computes indicators
Publish or Perish; Scholarometer; Microsoft Academics; Google+; and many
more...
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
5 / 24
Preliminaries
Scientometrics
Chronos
2015
Web of Science(Thomson Reuter)
Scopus (Elsevier)
Google Scholar
2004
2006
h-index
C.Labbé (UJF-LIG)
2008
2010
PoP
V1.0
2012
2014
Abiteboul par
l’administrateur
du Collège de
France
Ike Antkare & Co
Octobre 2014
6 / 24
Preliminaries
Scientometrics
Chronos
2015
Web of Science(Thomson Reuter)
Scopus (Elsevier)
Google Scholar
2004
2006
h-index
2008
2010
PoP
V1.0
2012
2014
Abiteboul par
l’administrateur
du Collège de
France
Tools to generate publications.
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
6 / 24
Preliminaries
SCIgen a Probabilistic Context Free Grammar
PCFG: Probabilistic Context Free Grammar
Sets of symbols
Set of non terminal symbols N = {SP, S, V, P},
Set of terminal symbols
⌃ = {”.”, sing , dance, flight, seas, oceans, air , streets, hills, fields}.
Set of rules Ri
R1 :
R2 :
R3 :
R4 :
R5..7 :
R8..13 :
SP
S
S
S
V
P
!
!
!
!
!
!
S.
We shall V in the P
We shall V in the P, S
We shall V in the P and in the P, S
sing |dance|flight
seas|oceans|air |streets|hills|fields
p(R1 )=1
p(R2 )=1/4
p(R3 )=1/2
p(R4 )=1/4
p(Ri )=1/3
i=5..7
p(Ri )=1/6
i=8..13
Terminal string example:
s : We
Q shall sing in the air and in the hills, We shall dance in the fields.
p(s) = j p(Rj )
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
7 / 24
Preliminaries
SCIgen a Probabilistic Context Free Grammar
PCFG: Probabilistic Context Free Grammar
Sets of symbols
Set of non terminal symbols N = {SP, S, V, P},
Set of terminal symbols
⌃ = {”.”, sing , dance, flight, seas, oceans, air , streets, hills, fields}.
Set of rules Ri
R1 :
R2 :
R3 :
R4 :
R5..7 :
R8..13 :
SP
S
S
S
V
P
!
!
!
!
!
!
S.
We shall V in the P
We shall V in the P, S
We shall V in the P and in the P, S
sing |dance|flight
seas|oceans|air |streets|hills|fields
p(R1 )=1
p(R2 )=1/4
Non zero
p(R3 )=1/2
probability
p(R4 )=1/4
to 1
p(Ri )=1/3
i=5..7
p(Ri )=1/6
i=8..13
Terminal string example:
s : We
Q shall sing in the air and in the hills, We shall dance in the fields.
p(s) = j p(Rj )
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
7 / 24
Preliminaries
SCIgen:
SCIgen a Probabilistic Context Free Grammar
2005 by J. Stribling, M. Krohn & D. Aguayo
... maximize amusement, rather than coherence ...
Titre
Abstract
Intro_A
Intro_A2
Introduction
Intro_A3
Model
Impl
Eval
RelatedWork
Concl
References
Intro_closing
Intro_A
!
Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ...
Intro_A
!
SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until...
!
The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends...
!
...
Intro_A
Intro_A
Intro_A
Intro_A
...
!
In recent years, much research has been devoted to the SCI_ACT; , ...
!
The SCI_ACT is a SCI_ADJSCI_PROBLEM.
!
The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have...
SCI_PEOPLE
SCI_BUZZWORD_ADJ
C.Labbé (UJF-LIG)
!
!
steganographers, cyberinformaticians, futurists, cyberneticists, ...
omniscient, introspective, peer
Ike Antkare & Co
to
peer, ambimorphic, ...
Octobre 2014
8 / 24
Preliminaries
SCIgen a Probabilistic Context Free Grammar
SCIGen example
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
9 / 24
Preliminaries
SCIgen a Probabilistic Context Free Grammar
Chronos
2015
Web of Science(Thomson Reuter)
Scopus (Elsevier)
Google Scholar
2004
2006
h-index
2008
2010
PoP
V1.0
2014
Abiteboul par
l’administrateur
du Collège de
France
SCIgen
C.Labbé (UJF-LIG)
2012
Ike Antkare & Co
Octobre 2014
10 / 24
Ike Antkare, one of the great starts in the scientific firmament
Table of Contents
1
Preliminaries
Scientometrics
SCIgen a Probabilistic Context Free Grammar
2
Ike Antkare, one of the great starts in the scientific firmament
3
Detection of SCIgen papers : June 2012
Google Search
Automatic classification
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
11 / 24
Ike Antkare, one of the great starts in the scientific firmament
SCIgen texts citing SCIgen texts
[Labbé, 2010]
Modified SCIgen
...
...
C.Labbé (UJF-LIG)
100
...
...
0
Real Documents
1
...
Ike Antkare’s 101 Documents
Ike Antkare & Co
Octobre 2014
12 / 24
Ike Antkare, one of the great starts in the scientific firmament
Ike Antkare h-index according GS (2010)
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
13 / 24
Ike Antkare, one of the great starts in the scientific firmament
Chronos
2015
Web of Science(Thomson Reuter)
Scopus (Elsevier)
Google Scholar
2004
2006
h-index
PoP
V1.0
2008
2010
Ike
Antkare
SCIgen
C.Labbé (UJF-LIG)
Ike Antkare & Co
2012
2014
Abiteboul par
l’administrateur
du Collège de
France
Octobre 2014
14 / 24
Ike Antkare, one of the great starts in the scientific firmament
Get cited or Perish
Conclusion
Completeness
Accuracy
Robustness
Google Scholar
(free)
Good
Good enough
Spamable
WoK / Scopus
(fee-based)
incomplete
Error Free
Excellent
A scholar/scientific would never fraud like that...
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
15 / 24
Ike Antkare, one of the great starts in the scientific firmament
Get cited or Perish
Conclusion
Completeness
Accuracy
Robustness
Google Scholar
(free)
Good
Good enough
Spamable
WoK / Scopus
(fee-based)
incomplete
Error Free
Excellent
A scholar/scientific would never fraud like that...
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
15 / 24
Detection of SCIgen papers : June 2012
Table of Contents
1
Preliminaries
Scientometrics
SCIgen a Probabilistic Context Free Grammar
2
Ike Antkare, one of the great starts in the scientific firmament
3
Detection of SCIgen papers : June 2012
Google Search
Automatic classification
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
16 / 24
Detection of SCIgen papers : June 2012
Google Search
Phrase search and More Like This
IEEE http://www.computer.org
Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ...
In recent years, much research has been devoted to the SCI_ACT; ...
SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ...
The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ...
The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
17 / 24
Detection of SCIgen papers : June 2012
Google Search
Phrase search and More Like This
IEEE http://www.computer.org
Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ...
In recent years, much research has been devoted to the SCI_ACT; ...
SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ...
The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ...
The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...
Corpus
name
Downloaded
from
Years
Type
of papers
Number
of papers
Acceptance
rate
Corpus
size
MLT
IEEE
ieee.org
2008
2010
various
122
NA
122
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
17 / 24
Detection of SCIgen papers : June 2012
Intertextual Distance:
Automatic classification
[Labbé and Labbé, 2006]
A: {le le chat} ( 13 , 23 , 03 )
B: {un chat chat } ( 23 , 03 , 13 )
un
un
un
un chat chat
1/3
un chat chat
un chat chat
1/3
1/3
chat
chat
2/3
2/3
2/3
2/3
2/3
2/3
le le chat
le
chat
le le chat
le
Intertextual Distance: D(A,B) =
le le chat
le
1
2
P
i2(A[B) |fi,A
fi,B | =
2
3
Interpretation:
D(A,B) =
the proportion of word tokens that are different in the two texts.
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
18 / 24
Detection of SCIgen papers : June 2012
Automatic classification
SCIgen Detection: proposed method
http://scigendetection.imag.fr
Corpus
Downloaded
Years
Field
Corpus size
arXiv1
arxiv.org
08–10
Computer Science
15338
MLT
ieee.org
08–10
Computer Science
122
SCIgen-Origin
Original SCIgen
–
Computer Science
236
SCIgen-Physics
Modified SCIgen
–
Physics
414
Let
t be a text under test.
Fake
t
If
Fake
t
be the distance between t and the nearest fake
< 0.55
Then SCIgen origin must be seriously considered (misclass. risk < 10
Else (
Fake
t
5
).
> 0.55) non-SCIgen origin must be seriously considered.
1 open repository for scholarly papers
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
19 / 24
(Z, MTL, SCIgen)
Dendrogram
Distance
0.7
l
0.6
l
l l
l
l
l
I
l
II
I
I
I
0.3
I
l
I
l l
ll l
l ll
l l l ll
l ll
l ll !
! ! !!
!
l !!
! !!
!!! !
!!
!!! !
!! ! !!
!! !
!
!! !!
!!! !
!
l ll l!
!l l
l l l!l l
!!
! l !l!
!!
!!! !!
l
l l
ll
ll lll
!
!ll l l
!l l
l! !l
l! ! !
l !
I
I
I
l
l l
l
l I I
II
I II II I
I II
III
I
I
I I I
II
I
I
I I
I
II I
II I I
III
I
I II I I
I I
I I
I II I
I
I
III I
I I
l
0.4
I II I
I I
I II
I l
II I I
I I I
I
I I
II
I
I
I
I
l l
I I l
lI I
l l
l
0.5
l
l
l
0.2
l
l
l
0.1
l
l
l
0.0
20 / 24
Octobre 2014
Ike Antkare & Co
C.Labbé (UJF-LIG)
MLT
Corpus Z
SCIGen
Automatic classification
Detection of SCIgen papers : June 2012
Detection of SCIgen papers : June 2012
Automatic classification
Scopus, Wok,...
2015
Web of Science(Thomson Reuter)
Nature
Scopus (Elsevier)
Google Scholar
2004
2006
h-index
PoP
V1.0
2008
2010
Ike
Antkare
SCIgen
C.Labbé (UJF-LIG)
Ike Antkare & Co
2012
2014
Abiteboul par
l’administrateur
du Collège de
France
Octobre 2014
21 / 24
Detection of SCIgen papers : June 2012
Automatic classification
Related/Ongoing Work
Spoofing
[Beel and Gipp, 2010, Lopez-Cozar et al., 2012]
, Academic optim.
;
[Beel et al., 2010]
Detecting methods: Bib. based [Xiong and Huang, 2009], Compression
ad-hoc dist. [Lavoie and Krishnamoorthy, 2010], Phrase search [Springer, 2014].
,
[Dalkilic et al., 2006]
No SCIgen paper in arXiv (Computer Science)
Image borrowed
from [Ginsparg, 2014];
PCA, only stop-words.
Supposed non Zipfian.
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
22 / 24
Detection of SCIgen papers : June 2012
Automatic classification
Conclusion and Future/Ongoing works
Publication procedures, models and habits
Why fake papers were accepted, published and ... sold.
Traditional publisher vs open access.
Knowledge diffusion: better and less... or as much as possible.
Automatic knowledge extraction/detection/generation.
Blind management rules...
... are an incitation to malpractices: slicing, plagiarism, faked data, ...
Automatic detection of new generators
Hand written PCFG : find dense cluster inside a population.
Study other kind of generator (language model).
In the web today
How to separate the wheat from the chaff... and scale up !
C.Labbé (UJF-LIG)
Ike Antkare & Co
Octobre 2014
23 / 24
Detection of SCIgen papers : June 2012
Automatic classification
Thanks
Beel, J. and Gipp, B. (2010).
Academic search engine spam and google scholar’s
resilience against it.
Journal of Electronic Publishing, 13(3).
Beel, J., Gipp, B., and Wilde, E. (2010).
Academic search engine optimization (aseo).
Journal of scholarly publishing, 41(2):176–190.
Labbé, C. and Labbé, D. (2006).
A tool for literary studies. intertextual distance and
tree classification.
Literary and Linguistic Computing, 21(3):311–326.
Labbé, C. and Labbé, D. (2013).
Dalkilic, M. M., Clark, W. T., Costello, J. C., and
Radivojac, P. (2006).
Using compression to identify classes of inauthentic
texts.
In Proceedings of the 2006 SIAM Conference on Data
Mining.
Ginsparg, P. (2014).
Automated screening: Arxiv screens spot fake papers.
Nature, 508(7494):44–44.
Hirsch, J. E. (2005).
An index to quantify an individual’s scientific research
output.
Proceedings of the National Academy of Science,
102:16569–16572.
Labbé, C. (2010).
Ike antkare, one of the great stars in the scientific
firmament.
C.Labbé (UJF-LIG)
International Society for Scientometrics and
Informetrics Newsletter, 6(2):48–52.
Ike Antkare & Co
Duplicate and fake publications in the scientific
literature: how many scigen papers in computer
science?
Scientometrics, 94(1):379–396.
Lavoie, A. and Krishnamoorthy, M. (2010).
Algorithmic Detection of Computer Generated Text.
ArXiv e-prints.
Lopez-Cozar, E. D., Robinson-García, N., and
Torres-Salinas, D. (2012).
Manipulating google scholar citations and google
scholar metrics: Simple, easy and tempting.
arXiv preprint arXiv:1212.0638.
Xiong, J. and Huang, T. (2009).
An effective method to identify machine automatically
generated paper.
In Knowledge Engineering and Software Engineering,
2009. KESE ’09. Pacific-Asia Conference on, pages
101–102.
Octobre 2014
24 / 24

Documents pareils

Publication List

Publication List Gravity, Gauge Theory and Strings. Springer-Verlag, Berlin; EDP Sciences, Les Ulis, 2003, (2003). Les Houches - Ecole d’Eté de Physique Théorique, Session LXXVI, Les Houches, France, 2001-07-30/2...

Plus en détail