29 1 2015 1 JOURNAL OFCHINESEINFORMATION PROCESSING Vol.29No.1 Jan.2015 1003-0077(2015)01-0057-10 (1. 215006) (2. 215006) (CrossDocumentCoreferenceCDC) CDC ACE2005 ACE CDC TP391 A ConstructionofInformationExtraction-orientatedChinese CrossDocumentCoreferenceCorpus ZHAOZhiweiQIAN LonghuaZHOU Guodong (1.NaturalLanguageProcessingLaboratorySoochow UniversitySuzhouJiangsu215006China 2.SchoolofComputerScience & TechnologySoochow UniversitySuzhouJiangsu215006China) 1 [2] [3] AbstractCrossDocumentCoreference(CDC)resolutionisanimportantstepininformationintegrationandinforma- tionfusion.asaconsequenceacdccorpusisindispensableforresearchandevaluationofcdcresolution.given thefactthatnochinesecdccorpusispubliclyavailableorientedforinformationextractionthispaperdescribes howtobuildacdccorpusbasedontheace2005chinesecorpusviaautomaticgenerationandmanualannotation whichcoversaltheaceentitytypes.thecorpusismadepubliclyavailabletoadvancetheresearchonchinese CDCresolution.InadditionthispaperanalysesthetypesandcharacteristicsofCDCinChinesetextaswelaspro- posestheconceptoftwometricsi.e. variationperplexity and ambiguityperplexity toevaluatethedificulty ofchinesecdcresolutionprovidingsomeinsightsforfurthercdcresearch. Keywordscrossdocumentcoreferenceinformationextractioncorporaannotationperplexity MUC-6 [4] (CrossDocumentCo-referenceCDC) [1] ACE2008 [5] (GlobalEntityDetectionandRec- 2012-04-09 2012-08-06 (6087315090920004) (BK201021911KJA520003)
58 2015 ognitiongedr) CDC JHU ACE2005 CDC ACE2005 IBM CDC ACE2005 CDC 1 ACE2005 2 [35-8] CDC CDC [6-79-10] NIST ACE2008 GEDR CDC 1 1 CDC John Smiths [6] WePS2007 [9] CIPS2010 [10] ACE2008 [5] 197 1 3477 49 2968 30 6932 32 5604 26 10 000 Wikipedia 4.3 WEB LinkCorpus [11] 150 Person-X [7] 40000 1 zip htp//nlp.suda.edu.cn/~qianlonghua/ace2005-cdc.
1 59 ACE2005 3660 CDC [12] 7129 633 ACE2005 3816 CDC 6771 CDC WEB (ORG) a)web WEB [6] JohnSmiths [11] WikiLink WePS [9] CIPS [10] ACE2008 JHU2007CLSP ACE2005CDC Person-X [7] ACE2005 Person-X CDC Singh [11] ACE2008 CDC CDC PER ORG GPE CDC 2419 CDC ACE2008 GEDR GRDR(GlobalRelation 10000 [12] ACE2005 CDC ACE2005 CDC 85.4% b) CDC CDC CDC NIST ACE2005 Detectionand Recognition) TAC CDC ACE2008 (TextAnalysisConference) KBP (Knowl- edgebasepopulation) [13] (PER) (EntityLinking)
60 2015 ( ) (EntityDisambiguation) [14] ID 3.2 (Infobox) 3.3 [15-16] Web [9] Web [14] A B Web 3 ACE2005 1 (headword) 3.1 ACE XML ACE XML XML C 1 CDC
1 61 4.1 a) Passoneau [17] Krippendorf [18] alpha Passoneau b) 0 GPE 0.33 0.67 1 3.4 2 4.2 CDC ID ENT TYPE 4.2.1 ENTITY DOC 3618 6771 ID ( ) FAC MENTION 299 GPE 2419 LOC 233 ORG 2 CDC 4 CDC alpha 96% Krippendorf 67% alpha 1939 PER 1860 VEH WEA 17 4 2795 2795 41.3% 3976 643 3 GPE 85% 55%ORG PER LOC FAC GPE 4.2.2
62 2015 3 CDC 643 4.2.3 255 CDC 40% 7% 3795 3795 1 666 GPE BandaAceh ACE ORG DPRK(DemocraticPeople s RepublicofKorea) PER 2 ACE ACE 2 3 /% 61.2 14.5 15.3 9.0 70 GPE
1 63 H v E( i) E i =- j=1 pj log p( j ) (1) C e H v ( C) = w e i H v (E i ) (2) i=1 PP v =2 H v() C (3) H v (E i ) E i H v (C) w i e E i C e 100% PP v 1.19 2795 3 1.36 /% 7.1 90.0 2.9 1.19 4.3 1.36 Popescu [19] 4.3.2 Popescu M i CDC (PP r ) 1.06 4.3.1 1.10 1.06 1.10 4.4 E i m j c j E i m j E i Ψ E( i) = E m{ j} i j=1 E i E i C e pj m j PP v C m E i pj =c / j c k PP r k=1 (1)~(3) B3 [20] (PP r ) (PP v )
64 2015 B3 TF*IDF 5 B3 ACE2005 4 4 CDC F1 0.909 0.735 0.813 ACE 0.963 0.820 0.886 0.633 0.425 0.508 GPE ORG PER CDC [321] CDC ACE2005 CDC CDC ( ) ( ) [1] DaganIItaiA.AutomaticProcessingofLargeCor- porafortheresolutionofanaphorareferences[c]// Proceedingsofthe13thconferenceon Computational linguistics.stroudsburgpausahanskarlgren 1990330-332. [2]. [D]. 2000. [3] MayfieldJAlexanderDDorrBetal.Cross-Docu- mentcoreferenceresolutiona Key Technologyfor Learningby Reading[C]//Proceedingsofthe AAAI
1 65 2009SpringSymposium onlearningby Readingand Learningto Read.StanfordCalifornia March 23 200965-70. ence-6abriefhistory[c]//proceedingsofthe16th Conferenceon ComputationalLinguistics (COLING' 96).CopenhagenDenmarkAugust199605-09. [5] NISTSpeechGroup.TheACE2008evaluationplan AssessmentofDetectionand RecognitionofEntities andrelationswithinandacrossdocuments[eb/ol]. htp//www.nist.gov/speech/tests/ace/2008/doc/ ace08-evalplan.v1.2d.pdf2008. [6] Bagga ABaldwin B.Entity-Based Cross-Document Coreferencing Usingthe VectorSpace Model[C]// [4] GrishmanRBethS.MessageUnderstandingConfer- Proceedingsofthe36thAnnualMeetingoftheAssoci- ationforcomputationallinguisticsandthe17thinter- national Conference on Computational Linguistics (COLING-ACL'98). Montréal Québec Canada 199879-85. [7] GooiC HAlanJ.Cross-DocumentCoreferenceona LargeScaleCorpus[C]//ProceedingsofHLT-NAACL 2004.USA20049-16. [8] Batista-NavarroR TAnaniadouS.BuildingaCoref- erence-annotatedcorpusfromthedomainofbiochem- istry[c]//proceedingsofthe2011 WorkshoponBio- medical Natural Language Processing ACL-HLT 2011.PortlandOregonUSAJune23-24201183- 91. [9] ArtilesJGonzaloJSekineS.Web PeopleSearch TaskatSemEval-2007[EB/OL].htp//nlp.uned.es/ weps/weps2007 data readme 1.1.txt2007 [10] CIPS-SIGHAN Joint Conference on Chinese Lan- guage Processing (CLP2010)[EB/OL]. htp// www.cipsc.org.cn/clp2010/task3 ch.htm2010. [11] Singh SSubramanya APereira Fetal.Large- ScaleCross-DocumentCoreferenceUsingDistributed Inferenceand Hierarchical Models[C]//Proceedings ofthe49th Annual Meetingofthe Associationfor ComputationalLinguistics.PortlandOregon2011 793-803. [12] CLSPSummerWorkshop.ExploitingLexical& En- cyclopedicresourcesforentitydisambiguation[eb/ OL].htp//www.clsp.jhu.edu/ws2007/groups/ elerfed/documents/elerfed-cdc-overview. v2. ppt2007. [13] TaskDescriptionfor KnowledgeBasePopulationat TAC 2009[EB/OL].htp//apl.jhu.edu/~paul- mac/kbp/090601-kbptaskguidelines.pdf2009 [14]. [J]. 201125(6)98-110. [15] Rao D McNamee PDredze M.Entity Linking Finding Extracted Entitiesin a Knowledge Base Multi-source Multi-lingual Information Extraction andsummarization[m].germanyspringer2011. ationbasedon Wikipedia Data[C]//Proceedingsof EmpiricalMethodsin NaturalLanguageProcessing. PragueJune28-302007708-716. [16] Cucerzan S.Large-Scale Named Entity Disambigu- [17] PassoneauRJ.Computingreliabilityforcoreference annotation[c]//proceedingsoftheinternationalcon- ferenceon Language Resouces (LREC).Lisbon PortugalMay2004. [18] KrippendorfK H.ContentAnalysisAnIntroduc- tiontoits Methodology[M].Beverly HilsCA SagePublications1980. [19] PopescuO.PersonCrossDocumentCoreferencewith NamePerplexity Estimates[C]//Proceedingsofthe 2009 Conference on Empirical Methodsin Natural LanguageProcessing.Singapore6-7 August2009 997-1006. adaspainmay1998. [21] Baron AFreedman M.Whois Whoand Whatis WhatExperimentsinCross-DocumentCo-Reference [20] BaggaA.EvaluationofCoreferencesandCoreference Resolution Systems[C]//Proceedings ofthe First LanguageResourceandEvaluationConference.Gran- [C]//Proceedingsofthe2008ConferenceonEmpiri- calmethodsinnaturallanguageprocessing.hono- luluoctober2008274-283. (1987 ) E-mailnone.zhao@gmail.com (1966 ) E-mailqianlonghua@suda.edu.cn
66 2015 (1967 ) E-mailgdzhou@ suda.edu.cn 櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚櫚 ( 48 ) [8] PeterLinz. dom FieldsProbabilistic ModelsforSegmentingand [M]. 2004. LabelingSequenceData[C]//Proceedingsofthe18th [9]. ICML-012001282-289. [J]. [14] ZhangKunliZhang WencongZan Hongyingetal. ( )200842(2)190-194. Studiesonautomaticrecognitionofseveralcommon [10]. Chineseadverbs usagesbasedonbpneuralnetworks [J]. 2006317-20. [11] SchubertFooHuiLi.Chinese wordsegmentation anditsefectoninformationretrieval[j].informa- tionprocessingand Management200440(1)161-191. [12] GeorgeA Miler.WordNetA LexicalDatabasefor English[C]//ProceedingsofCommunicationsofthe ACM.19953839-41. [13] LafertyJMcCalum APereiraF.ConditionalRan- [C]//Proceedingsofthe10thChineseLexicalSeman- tics Workshop. 200931-37. [15] LovaszLPlummer M D. Matchingtheory [M]. AmsterdamElsevierScience2009. [16].BFS-CTC [J]. 201327(1)72-80. [17]. [J]. 201226(6)65-71. (1968 ) E-mailjiasuimin@163.com (1986 ) E-mailleili lei@163.com (1973 ) E-mailhero jack@163.com