Stylochronometry: Timeline Prediction in Stylometric Analysis

Klaussner, Carmen; Vogel, Carl

doi:10.1007/978-3-319-25032-8_6

Carmen Klaussner³ &
Carl Vogel⁴

Included in the following conference series:

International Conference on Innovative Techniques and Applications of Artificial Intelligence

543 Accesses
4 Citations

Abstract

We examine stylochronometry, the question of measuring change in linguistic style over time within an authorial canon and in relation to change in language in general use over a contemporaneous period. We take the works of two prolific authors from the 19th/20th century, Henry James and Mark Twain, and identify variables that change for them over time. We present a method of analysis applying regression on linguistic variables in predicting a temporal variable. In order to identify individual authors’ effects on the model, we compare the model based on the novelists’ works to a model based on a 19th/20th century American English reference set. We evaluate using \(R^2\) and Root mean square error (RMSE), that indicates the average error on predicting the year. On the two-author data, we achieve an RMSE of \(\pm \)7.2 years on unseen data (baseline: \(\pm \)13.2); for the larger reference set, our model obtains an RMSE of \(\pm \)4 on unseen data (baseline: \(\pm \)17).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See Coleman [6], Frontini et al. [9] for discussion of attempts in the 15th century to date a text purported to be from the 3rd but shown to be most likely from (circa) the 8th. The former depends on manual methods and the latter, semi-automatic methods.
2.
Early-Wittgenstein may be stylistically as well as conceptually distinct from Late-Wittgenstein.
3.
ANalysis Of VAriance (ANOVA) is a collection of methods developed by R.A. Fisher to analyze differences within and between different groups. Principal Component Analysis (PCA) is an unsupervised statistical technique to convert a set of possibly related variables to a new uncorrelated representation or principal components.
4.
The coefficient of determination \(R^2\) indicates how well a model fits the observed data ranging from 0 to 1–0 indicating a poor fit and 1 a perfect one; in the case of evaluating predictions against the outcome (test set) values can also range from \(-1\) to 1;–in the case of negative values, the mean of the data provides a better fit.
5.
Distinctiveness Ratio: Measure of variability defined by the rate of occurrence of a word in a text divided by its rate of occurrence in another.
6.
Here, we only include the main works/novels for reasons of text length and genre homogeneity.
7.
http://www.gutenberg.org/ - last verified August 2015.
8.
https://archive.org/ - last verified August 2015.
9.
The separate entries are created using the POS tags assigned by the tagger to the individual word entity in its context.
10.
This is without loss of generality to the bag-of-words analysis of texts in which sentence structures are not used subsequent to POS tagging.
11.
Lexical features are continuous here because we use relative frequencies.
12.
This applies if it is meaningful to count instances of the variable, as it is for token n-grams: such relativization does not apply, for example, to average word lengths.
13.
This is not to argue that complementary categories (e.g. relativized counts of features that are not shared between both authors over the entire duration or features that are never shared by the authors over the duration, etc.) are uninteresting. However, for this work we are addressing change in language shared by the two authors and relative to change in background language of their time, thinking that this provides an interesting perspective on their distinctiveness from each other and everyone else.
14.
The t-value measures the size of the difference between an observed sample statistic and its hypothesized population parameter relative to the variation in the sample data. The further the t-value falls on either side of the t-distribution, the greater the evidence against the null hypothesis that there is no significant difference between hypothesized and observed value.
15.
In this case, the Akaike information criterion (AIC) is used to evaluate the model: \(AIC = - 2*log L + 2k\), where L is the likelihood and k the number of estimated parameters in the model. Thus, AIC rewards goodness-of-fit, but penalizes the number of parameters in the model.
16.
All models reported on here had reliable \(\bar{R}^2\) values at a level of a p-value \({<}0.0001\) associated to them, so we dispense with reporting on this in each individual case.
17.
\(RMSE = \sqrt{\frac{\sum ^{n}_{t=1} (\hat{y}_t - y)^2}{n}}\).
18.
This might not be an entirely realistic scenario in that most predictors, even randomly selected ones, will bear some kind of relation with the response. However, in the case of the test set, the wrong predictors can also have a worse effect than the null-model, so this might be an acceptable approximation.
19.
The system reports estimates and predictions as decimals; we dispense with reporting these here, as texts were only ordered according to year rather than exact month, which renders those numbers meaningless. \(R^2\) and RMSE are on the basis of rounded versions of predictions.
20.
This can be tested by using the variable inflation factor (VIF) that measures how much the variance of the estimated coefficients in regression is inflated compared to when the predictors are not linearly related; a value of 1–4 indicating low correlation and 5–10 high correlation.

References

Ayres, A.: The Wit and Wisdom of Mark Twain. Harper Collins, New York (2010)
Google Scholar
Beach, J.W.: The Method of Henry James. Yale University Press, New Haven (1918)
Google Scholar
Brooks, V.W.: The ordeal of mark twain. William Heineman, London (1922)
Google Scholar
Can, F., Patton, J.M.: Change of writing style with time. Comput. Humanit. 38(1), 61–82 (2004)
Article Google Scholar
Canby, H.S.: Turn West, Turn East: Mark Twain and Henry James. Biblo & Tannen Publishers, New York (1951)
Google Scholar
Coleman, C.: The Treatise Lorenzo Valla on the Donation of Constantine: Text and Translation. First published 1922. Russell & Russell, New York (1971)
Google Scholar
Daelemans, W.: Explanation in computational stylometry. In: Computational Linguistics and Intelligent Text Processing, pp. 451–462. Springer, New York (2013)
Google Scholar
Davies, M.: The Corpus of Historical American English: 400 million words,1810–2009. http://corpus.byu.edu/coha/, vol. 24, p. 2011 (2010). Accessed 24 Aug 2015
Frontini, F., Lynch, G., Vogel, C.: Revisiting the donation of constantine. In: Kibble, R., Rauchas. S. (eds.) Artificial Intelligence and Simulation of Behavior—Symposium: Style in Text, vol. 2008, pp. 1–9 (2008)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hoover, D.L.: Corpus stylistics, stylometry, and the styles of Henry James. In: Style 41.2 (2007)
Google Scholar
Jed Wing, M.K.C., et al.: caret: Classification and Regression Training. R package version 6.0–30, http://CRAN.R-project.org/package=caret (2014). Accessed 24 Aug 2015
Kemper, S., et al.: Language decline across the life span: findings from the Nun Study. In: Psychology and aging 16.2, p. 227 (2001)
Google Scholar
Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting Methods and Applications. Wiley, New York (2008)
Google Scholar
Michalke, M.: koRpus: An R Package for Text Analysis. Version 0.04-40. http://reaktanz.de/?c=hacking&s=koRpus (2013). Accessed 24 Aug 2015
Pena, E.A., Slate, E.H.: gvlma: Global Validation of Linear Models Assumptions. R package version 1.0.0.2. http://CRAN.R-project.org/package=gvlma (2004). Accessed 24 Aug 2015
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.r-project (2014). Accessed 24 Aug 2015
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, vol. 12, pp. 44–49. Manchester (1994)
Google Scholar
Stamou, C.: Stylochronometry: stylistic development, sequence of composition, and relative dating. Lit. Linguist. Comput. 23(2), 181–199 (2008)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This research is supported by Science Foundation Ireland through the CNGL Programme (Grant 12/CE/I2267) in the ADAPT Centre (www.adaptcentre.ie) at Trinity College Dublin.

Author information

Authors and Affiliations

ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland
Carmen Klaussner
Centre for Computing and Language Studies, Trinity College Dublin, Dublin, Ireland
Carl Vogel

Authors

Carmen Klaussner
View author publications
You can also search for this author in PubMed Google Scholar
Carl Vogel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carmen Klaussner .

Editor information

Editors and Affiliations

School of Computing, University of Portsmouth, Portsmouth, United Kingdom
Max Bramer
School of Computing, Engineering and Mathematics, University of Brighton, Brighton, United Kingdom
Miltos Petridis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klaussner, C., Vogel, C. (2015). Stylochronometry: Timeline Prediction in Stylometric Analysis. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXII. SGAI 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-25032-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-25032-8_6
Published: 12 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25030-4
Online ISBN: 978-3-319-25032-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics