Abstract
In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption.
In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation—that we call Gradient-Based Transformation (GBT)—can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.
- Ahmed Abbasi and Hsinchun Chen. 2005a. Applying authorship analysis to Arabic web content. In Intelligence and Security Informatics. Lecture Notes in Computer Science, Vol. 3495. Springer, Berlin, 183--197. Google Scholar
Digital Library
- Ahmed Abbasi and Hsinchun Chen. 2005b. Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20 (September 2005), 67--75. Issue 5. DOI:http://dx.doi.org/10.1109/MIS.2005.81 Google Scholar
Digital Library
- Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inform. Syst. 26, 2 (2008), 1--29. Google Scholar
Digital Library
- Ahmed Abbasi, Hsinchun Chen, Sven Thoms, and Tianjun Fu. 2008. Affect analysis of web forums and blogs using correlation ensembles. IEEE Trans. Knowl. Data Eng. 20, 9 (Sept. 2008), 1168--1180. DOI:http://dx.doi.org/10.1109/TKDE.2008.51 Google Scholar
Digital Library
- A. Abbasi, S. France, Zhu Zhang, and Hsinchun Chen. 2011. Selecting attributes for sentiment classification using feature relation networks. IEEE Trans. Knowl. Data Eng. 23, 3 (March 2011), 447--462. DOI:http://dx.doi.org/10.1109/TKDE.2010.110 Google Scholar
Digital Library
- Werner Adler, Alexander Brenning, Sergej Potapov, Matthias Schmid, and Berthold Lausen. 2011a. Ensemble classification of paired data. Comput. Stat. Data Anal. 55, 5 (2011), 1933--1941. DOI:http://dx.doi.org/10.1016/j.csda.2010.11.017 Google Scholar
Digital Library
- Werner Adler, Sergej Potapov, and Berthold Lausen. 2011b. Classification of repeated measurements data using tree-based ensemble methods. Comput. Stat. 26 (2011), 355--369. Issue 2. DOI:http://dx.doi.org/10.1007/s00180-011-0249-1 Google Scholar
Digital Library
- Philipp Sebastian Angermeyer. 2009. Translation style and participant roles in court interpreting. J. Socioling. 13, 1 (2009), 3--28.Google Scholar
Cross Ref
- Shlomo Argamon. 2008. Interpreting Burrows’s delta: Geometric and probabilistic foundations. Lit. Ling. Comput. 23, 2 (2008), 131--147. DOI:http://dx.doi.org/10.1093/llc/fqn003Google Scholar
Cross Ref
- Shlomo Argamon and Shlomo Levitan. 2005. Measuring the usefulness of function words for authorship attribution. In Proceedings of the 2005 ACH/ALLC Conference.Google Scholar
- Shlomo Argamon, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. 2007. Stylistic text classification using functional lexical features. J. Am. Soc. Inform. Sci. Technol. 58, 6 (2007), 802--822. DOI:http://dx.doi.org/10.1002/asi.20553 Google Scholar
Digital Library
- Mona Baker. 2000. Towards a methodology for investigating the style of a literary translator. Target 12, 2 (2000), 241--266.Google Scholar
Cross Ref
- Jos Nilo G. Binongo. 2003. Who wrote the 15th book of oz? An application of multivariate analysis to authorship attribution. Chance 16, 2 (2003), 9--17.Google Scholar
Cross Ref
- Edward Gaylord Bourne. 1897. The authorship of the federalist. Am. Hist. Rev. 2, 3 (1897), 443--460.Google Scholar
Cross Ref
- A. Brenning and B. Lausen. 2008. Estimating error rates in the classification of paired organs. Stat. Med. 27, 22 (2008), 4515--4531. cited By (since 1996) 10.Google Scholar
Cross Ref
- John Burrows. 2002. ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Lit. Ling. Comput. 17, 3 (2002), 267--287. DOI:http://dx.doi.org/10.1093/llc/17.3.267Google Scholar
Cross Ref
- John F. Burrows. 1987. Word patterns and story shapes: The statistical analysis of narrative style. J. Assoc. Lit. Ling. Comput. 2, 4 (1987), 61--70.Google Scholar
Cross Ref
- J. F. Burrows. 1989. ‘An ocean where each kind…’: Statistical analysis and some major determinants of literary style. Comput. Hum. 23, 4/5 (1989), pp. 309--321.Google Scholar
Cross Ref
- J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J.-F. Bonastre, and D. Matrouf. 2009. Forensic speaker recognition. IEEE Sign. Process. Mag. 26, 2 (March 2009), 95--103. DOI:http://dx.doi.org/10.1109/MSP.2008.931100Google Scholar
Cross Ref
- Sara Castagnoli. 2009. Regularities and Variations in Learner Translations : A Corpus-Based Study of Conjunctive Explicitation. Ph.D. Dissertation. University of Bologna, ITALY.Google Scholar
- Malcolm Coulthard and Alison Johnson. 2010. The Routledge Handbook of Forensic Linguistics. Routledge. 616 pages.Google Scholar
- O. de Vel, A. Anderson, M. Corney, and G. Mohay. 2001. Mining e-mail content for author identification forensics. Sigmod Rec. 30, 4 (2001), 55--64. Google Scholar
Digital Library
- Lenita M. R. Esteves. 2005. Intellectual property and copyright: The case of translators. Transl. J. 9(3) (2005).Google Scholar
- Michael Gamon. 2004. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). Association for Computational Linguistics, Stroudsburg, PA, USA, Article 611. DOI:http://dx.doi.org/10.3115/1220355.1220443 Google Scholar
Digital Library
- M. C. Ganiz, C. George, and W. M. Pottenger. 2011. Higher order Naïve bayes: A novel non-IID approach to text classification. IEEE Knowl. Data Eng. 23, 7 (July 2011), 1022--1034. DOI:http://dx.doi.org/10.1109/TKDE.2010.160 Google Scholar
Digital Library
- Antonio Miranda Garca and Javier Calle Martn. 2007. Function words in authorship attribution studies. Lit. Ling. Comput. 22, 1 (2007), 49--66.Google Scholar
Cross Ref
- Jack Grieve. 2007. Quantitative authorship attribution: An evaluation of techniques. Lit. Ling. Comput. 22, 3 (2007), 251--270.Google Scholar
Cross Ref
- S. Hanna. 2006. Translation and questions of identity (review article in Arabic). Diwan Al Arab (2006).Google Scholar
- S. Hanna. 2008. Translation studies: Beginnings, trajectories and questions of the future (in Arabic). Fusul 74 (2008), 36--48.Google Scholar
- Steffen Hedegaard and Jakob Grue Simonsen. 2011. Lost in translation: Authorship attribution using frame semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2. Association for Computational Linguistics, Stroudsburg, PA, USA, 65--70. Google Scholar
Digital Library
- D. I. Holmes and F. J. Tweedie. 1995. Forensic stylometry: A review of the cusum controversy. Revue Informatique et Statistique Dansles Sciences Humaines (1995), 19--47. Liege, Belgium: University of Liege.Google Scholar
- David I. Holmes. 1991. Vocabulary richness and the prophetic voice. Lit. Ling. Comput. 6, 4 (1991), 259--268. DOI:http://dx.doi.org/10.1093/llc/6.4.259Google Scholar
Cross Ref
- D. I. Holmes and R. S. Forsyth. 1995. The federalist revisited: New directions in authorship attribution. Lit. Ling. Comput. 10, 2 (1995), 111--127.Google Scholar
Cross Ref
- David I. Holmes, Lesley J. Gordon, and Christine Wilson. 2001. A widow and her soldier: Stylometry and the American civil war. Lit. Ling. Comput. 16, 4 (2001), 403--420. DOI:http://dx.doi.org/10.1093/llc/16.4.403Google Scholar
Cross Ref
- Janet Holmes. 2013. An Introduction to Sociolinguistics. Fourth Edition. London: Pearson.Google Scholar
- David L. Hoover. 2003a. Frequent collocations and authorial style. Lit. Ling. Comput. 18, 3 (2003), 261--286. DOI:http://dx.doi.org/10.1093/llc/18.3.261Google Scholar
Cross Ref
- David L. Hoover. 2003b. Multivariate analysis and the study of style variation. Lit. Ling. Comput. 18, 4 (2003), 341--360. DOI:http://dx.doi.org/10.1093/llc/18.4.341Google Scholar
Cross Ref
- David L. Hoover. 2004a. Delta prime? Lit. Ling. Comput. 19, 4 (2004), 477--495. DOI:http://dx.doi.org/10.1093/llc/19.4.477Google Scholar
Cross Ref
- David L. Hoover. 2004b. Testing Burrows’s delta. Lit. Ling. Comput. 19, 4 (2004), 453--475.Google Scholar
Cross Ref
- David L. Hoover and Shervin Hess. 2009. An exercise in non-ideal authorship attribution: The mysterious Maria ward. Lit. Ling. Comput. 24, 4 (2009), 467--489. DOI:http://dx.doi.org/10.1093/llc/fqp027Google Scholar
Cross Ref
- Ken Hyland and Brian Paltridge. 2011. Continuum Companion to Discourse Analysis. Continuum. 416 pages.Google Scholar
- Michael Jessen. 2008. Forensic phonetics. Lang. Ling. Compass 2, 4 (2008), 671--711. DOI:http://dx.doi.org/10.1111/j.1749-818X.2008.00066.xGoogle Scholar
Cross Ref
- Matthew L. Jockers and Daniela M. Witten. 2010. A comparative study of machine learning methods for authorship attribution. Lit. Ling. Comput. 25, 2 (2010), 215--223. DOI:http://dx.doi.org/10.1093/llc/fqq001Google Scholar
Cross Ref
- Patrick Juola and Darren Vescovi. 2010. Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security (AISec’10). ACM, New York, NY, 14--18. DOI:http://dx.doi.org/10.1145/1866423.1866427 Google Scholar
Digital Library
- Renata Kamenická. 2008. Translation Research Projects 1. Intercultural Studies Group, Universitat Rovira i Virgili, Chapter Explicitation profile and translator style, 117--130.Google Scholar
- Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Lit. Ling. Comput. 17, 4 (2002), 401--412. DOI:http://dx.doi.org/10.1093/llc/17.4.401Google Scholar
Cross Ref
- Moshe Koppel and Jonathan Schler. 2004. Authorship verification as a one-class classification problem. In Proceedings of the Twenty-First International Conference on Machine Learning. 62. Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inform. Sci. Technol. 60, 1 (2009), 9--26. Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. J. Mach. Learn. Res. 8 (December 2007), 1261--1276. Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, and Droz Mughaz. 2004. Text categorization for authorship verification. In Eighth International Symposium on Artificial Intelligence and Mathematics. Fort Lauderdale, Florida, http://rutcor.rutgers.edu/ ∼ amai/aimath04/SpecialSessions/Koppel-aimath04.pdf.Google Scholar
- Moshe Koppel and Yaron Winter. 2014a. Determining if two documents are written by the same author. J. Assoc. Inform. Sci. Technol. 65, 1 (2014), 178--187.Google Scholar
Digital Library
- Moshe Koppel and Yaron Winter. 2014b. Determining if two documents are written by the same author. J. Assoc. Inform. Sci. Technol. 65, 1 (2014), 178--187. DOI:http://dx.doi.org/10.1002/asi.22954Google Scholar
Digital Library
- Defeng Li, Chunling Zhang, and Kanglong Liu. 2011. Translation style and ideology: A corpus-assisted analysis of two english translations of Hongloumeng. Lit. Ling. Comput. 26, 2 (2011), 153--166.Google Scholar
Cross Ref
- David Madigan, Alexander Genkin, David D. Lewis, Shlomo Argamon, Dmitriy Fradkin, and Li Ye. 2005. Author identification on the large scale. In Proc. of the Meeting of the Classification Society of North America.Google Scholar
- Gerald R. McMenamin. 2002. Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, Boca Raton, FL.Google Scholar
Cross Ref
- T. C. Mendenhall. 1887. The characteristic curves of composition. Science ns-9, 214S (1887), 237--246. DOI:http://dx.doi.org/10.1126/science.ns-9.214S.237Google Scholar
- Mikhail Mikhailov and Miia Villikka. 2001. Is there such a thing as a translator’s style? In Proceedings of the Corpus Linguistics 2001 Conference, Paul Rayson, Andrew Wilson, Tony McEnery, Andrew Hardie, and Shereen Khoja (Eds.). Lancaster University (UK), Lancaster, 378--386.Google Scholar
- A. Q. Morton. 1965. The authorship of Greek prose. J. Roy. Stat. Soc. Ser. A (General) 128, 2 (1965), 169--233.Google Scholar
Cross Ref
- A. Q. Morton and S. Michaelson. 1990. The Qsum plot. Internal Report CSR-3-90, Department of Computer Science, University of Edinburgh (1990).Google Scholar
- F. Mosteller and D. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google Scholar
- John Olsson. 2008. Forensic Linguistics (2nd ed.). Continuum International Publishing Group.Google Scholar
- Daniel Pavelec, Edson Justino, Leonardo V. Batista, and Luiz S. Oliveira. 2008. Author identification using writer-dependent and writer-independent strategies. In Proceedings of the 2008 ACM Symposium on Applied Computing (SAC’08). ACM, New York, NY, 414--418. DOI:http://dx.doi.org/10.1145/1363686.1363788 Google Scholar
Digital Library
- D. Pavelec, L. S. Oliveira, E. Justino, F. D. Nobre Neto, and L. V. Batista. 2009. Compression and stylometry for author identification. In Proceedings of the 2009 International Joint Conference on Neural Networks (IJCNN’09). IEEE Press, Piscataway, NJ, 669--674. Google Scholar
Digital Library
- David F. Dufty, Philip M. McCarthy, Gwyneth A. Lewis, and Danielle S. McNamar. 2006. Analyzing writing styles with coh-metrix. In Proceedings of the Florida Artificial Intelligence Research Society International Conference. 764--769.Google Scholar
- Anthony Pym. 1996. Venuti’s visibility. TARGETAMSTERDAM 8 (1996), 165--178.Google Scholar
- Alan Cooperman, Brian J. Grim, Mehtab S. Karim, Sahar Chaudhry, Becky Hsu, Jacqueline E. Wenger, Kimberly McKnight, Megan Pavlischek, and Hilary Ramp. 2009. Mapping the Global Muslim Population: A Report on the Size and Distribution of the World’s Muslim Population. Technical Report. The Pew Research Center.Google Scholar
- Joseph Rudman. 2005. The non-traditional case for the authorship of the twelve disputed “federalist” articles: A monument built on sand? In Proceedings of ACH/ALLC 2005. Victoria, BC, Canada.Google Scholar
- Jan Rybicki. 2012. The great mystery of the (almost) invisible translator: Stylometry in translation. In Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research, Michael P. Oakes and Meng Ji (Eds.). John Benjamins Publishing, 231--248.Google Scholar
- E. Stamatatos. 2006. Authorship attribution based on feature set subspacing ensembles. Int. J. Artif. Intell. Tools 15, 5 (2006), 823--838.Google Scholar
Cross Ref
- Efstathios Stamatatos. 2008. Author identification: Using text sampling to handle the class imbalance problem. Inform. Process. Manag. 44, 2 (2008), 790--799. Google Scholar
Digital Library
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60, 3 (2009), 538--556. Google Scholar
Digital Library
- Urszula Stanczyk and Krzysztof A. Cyran. 2008. Application of artificial neural networks to stylometric analysis. In Proceedings of the 8th Conference on Systems Theory and Scientific Computation. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 25--30. Google Scholar
Digital Library
- Matt Tearle, Kye Taylor, and Howard Demuth. 2008. An algorithm for automated authorship attribution using neural networks. Lit. Ling. Comput. 23, 4 (2008), 425--442. DOI:http://dx.doi.org/10.1093/llc/fqn022Google Scholar
Cross Ref
- Nikos Tsimboukakis and George Tambouratzis. 2010. A comparative study on authorship attribution classification tasks using both neural network and statistical methods. Neur. Comput. Appl. 19 (2010), 573--582. Issue 4. 10.1007/s00521-009-0314-7. Google Scholar
Digital Library
- F. J. Tweedie, S. Singh, and D. I. Holmes. 1996. Neural network applications in stylometry: The “federalist articles”. Comput. Hum. 30, 1 (1996), pp. 1--10.Google Scholar
Cross Ref
- Hans Van Halteren. 2004. Linguistic profiling for author recognition and verification. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04). Association for Computational Linguistics, Stroudsburg, PA, USA, Article 199. DOI:http://dx.doi.org/10.3115/1218955.1218981 Google Scholar
Digital Library
- P. Varela, E. Justino, and L. S. Oliveira. 2011. Selecting syntactic attributes for authorship attribution. In The 2011 International Joint Conference on Neural Networks (IJCNN). 167--172. DOI:http://dx.doi.org/10.1109/IJCNN.2011.6033217Google Scholar
Cross Ref
- Lawrence Venuti. 1995. The Translator’s Invisibility: A History of Translation. Vol. 2nd. Routledge. 353 pages.Google Scholar
- Can Wang, Zhong She, and Longbing Cao. 2013. Coupled attribute analysis on numerical data. In IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3--9, 2013, Francesca Rossi (Ed.). IJCAI/AAAI. Google Scholar
Digital Library
- Qing Wang and Defeng Li. 2012. Looking for translator’s fingerprints: A corpus-based study on Chinese translations of Ulysses. Lit. Ling. Comput. 27, 1 (2012), 81--93.Google Scholar
Cross Ref
- Marion Winters. 2004. F. Scott Fitzgerald’s die schnen und verdammten: A corpus-based study of loan words and code switches as features of translators’ style. Lang. Matters 35, 1 (2004), 248--258. DOI:http://dx.doi.org/10.1080/10228190408566215Google Scholar
Cross Ref
- Marion Winters. 2007. F. Scott Fitzgerald’s die schnen und verdammten: A corpus-based study of speech-act report verbs as a feature of translators’ style. Meta 52, 3 (2007), 412--425.Google Scholar
Cross Ref
- Marion Winters. 2010. From modal particles to point of view a theoretical framework for the analysis of translator attitude. Transl. Interpret. Stud. 5 (2010), 163--185.Google Scholar
Cross Ref
- Xu Xiumei. 2006. Style is the relationship a relevance-theoretic approach to the translator’s style. Babel 52, 4 (2006), 334--348. DOI:http://dx.doi.org/doi:10.1075/babel.52.4.04xiuGoogle Scholar
Cross Ref
- Ying Zhao and Justin Zobel. 2005. Effective and scalable authorship attribution using function words. In Proceedings of the Second Asia conference on Asia Information Retrieval Technology (AIRS’05). Springer-Verlag, Berlin, Heidelberg, 174--189. DOI:http://dx.doi.org/10.1007/11562382_14 Google Scholar
Digital Library
- Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inform. Sci. Technol. 57, 3 (2006), 378--393. DOI:http://dx.doi.org/10.1002/asi.20316 Google Scholar
Digital Library
Index Terms
Pairwise Comparative Classification for Translator Stylometric Analysis
Recommendations
Divide and translate: improving long distance reordering in statistical machine translation
WMT '10: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATRThis paper proposes a novel method for long distance, clause-level reordering in statistical machine translation (SMT). The proposed method separately translates clauses in the source sentence and reconstructs the target sentence using the clause ...
Towards achieving a delicate blending between rule-based translator and neural machine translator
AbstractPopular translators such as Google, Bing, etc., perform quite well when translating among the popular languages such as English, French, etc.; however, they make elementary mistakes when translating the low-resource languages such as Bengali, ...
A comparative study of classification methods for microarray data analysis
AusDM '06: Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61In response to the rapid development of DNA Microarray technology, many classification methods have been used for Microarray classification. SVMs, decision trees, Bagging, Boosting and Random Forest are commonly used methods. In this paper, we conduct ...






Comments