Skip to main content

Stylochronometry: Timeline Prediction in Stylometric Analysis

  • Conference paper
  • First Online:
Book cover Research and Development in Intelligent Systems XXXII (SGAI 2015)

Abstract

We examine stylochronometry, the question of measuring change in linguistic style over time within an authorial canon and in relation to change in language in general use over a contemporaneous period. We take the works of two prolific authors from the 19th/20th century, Henry James and Mark Twain, and identify variables that change for them over time. We present a method of analysis applying regression on linguistic variables in predicting a temporal variable. In order to identify individual authors’ effects on the model, we compare the model based on the novelists’ works to a model based on a 19th/20th century American English reference set. We evaluate using \(R^2\) and Root mean square error (RMSE), that indicates the average error on predicting the year. On the two-author data, we achieve an RMSE of \(\pm \)7.2 years on unseen data (baseline: \(\pm \)13.2); for the larger reference set, our model obtains an RMSE of \(\pm \)4 on unseen data (baseline: \(\pm \)17).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See Coleman [6], Frontini et al. [9] for discussion of attempts in the 15th century to date a text purported to be from the 3rd but shown to be most likely from (circa) the 8th. The former depends on manual methods and the latter, semi-automatic methods.

  2. 2.

    Early-Wittgenstein may be stylistically as well as conceptually distinct from Late-Wittgenstein.

  3. 3.

    ANalysis Of VAriance (ANOVA) is a collection of methods developed by R.A. Fisher to analyze differences within and between different groups. Principal Component Analysis (PCA) is an unsupervised statistical technique to convert a set of possibly related variables to a new uncorrelated representation or principal components.

  4. 4.

    The coefficient of determination \(R^2\) indicates how well a model fits the observed data ranging from 0 to 1–0 indicating a poor fit and 1 a perfect one; in the case of evaluating predictions against the outcome (test set) values can also range from \(-1\) to 1;–in the case of negative values, the mean of the data provides a better fit.

  5. 5.

    Distinctiveness Ratio: Measure of variability defined by the rate of occurrence of a word in a text divided by its rate of occurrence in another.

  6. 6.

    Here, we only include the main works/novels for reasons of text length and genre homogeneity.

  7. 7.

    http://www.gutenberg.org/ - last verified August 2015.

  8. 8.

    https://archive.org/ - last verified August 2015.

  9. 9.

    The separate entries are created using the POS tags assigned by the tagger to the individual word entity in its context.

  10. 10.

    This is without loss of generality to the bag-of-words analysis of texts in which sentence structures are not used subsequent to POS tagging.

  11. 11.

    Lexical features are continuous here because we use relative frequencies.

  12. 12.

    This applies if it is meaningful to count instances of the variable, as it is for token n-grams: such relativization does not apply, for example, to average word lengths.

  13. 13.

    This is not to argue that complementary categories (e.g. relativized counts of features that are not shared between both authors over the entire duration or features that are never shared by the authors over the duration, etc.) are uninteresting. However, for this work we are addressing change in language shared by the two authors and relative to change in background language of their time, thinking that this provides an interesting perspective on their distinctiveness from each other and everyone else.

  14. 14.

    The t-value measures the size of the difference between an observed sample statistic and its hypothesized population parameter relative to the variation in the sample data. The further the t-value falls on either side of the t-distribution, the greater the evidence against the null hypothesis that there is no significant difference between hypothesized and observed value.

  15. 15.

    In this case, the Akaike information criterion (AIC) is used to evaluate the model: \(AIC = - 2*log L + 2k\), where L is the likelihood and k the number of estimated parameters in the model. Thus, AIC rewards goodness-of-fit, but penalizes the number of parameters in the model.

  16. 16.

    All models reported on here had reliable \(\bar{R}^2\) values at a level of a p-value \({<}0.0001\) associated to them, so we dispense with reporting on this in each individual case.

  17. 17.

    \(RMSE = \sqrt{\frac{\sum ^{n}_{t=1} (\hat{y}_t - y)^2}{n}}\).

  18. 18.

    This might not be an entirely realistic scenario in that most predictors, even randomly selected ones, will bear some kind of relation with the response. However, in the case of the test set, the wrong predictors can also have a worse effect than the null-model, so this might be an acceptable approximation.

  19. 19.

    The system reports estimates and predictions as decimals; we dispense with reporting these here, as texts were only ordered according to year rather than exact month, which renders those numbers meaningless. \(R^2\) and RMSE are on the basis of rounded versions of predictions.

  20. 20.

    This can be tested by using the variable inflation factor (VIF) that measures how much the variance of the estimated coefficients in regression is inflated compared to when the predictors are not linearly related; a value of 1–4 indicating low correlation and 5–10 high correlation.

References

  1. Ayres, A.: The Wit and Wisdom of Mark Twain. Harper Collins, New York (2010)

    Google Scholar 

  2. Beach, J.W.: The Method of Henry James. Yale University Press, New Haven (1918)

    Google Scholar 

  3. Brooks, V.W.: The ordeal of mark twain. William Heineman, London (1922)

    Google Scholar 

  4. Can, F., Patton, J.M.: Change of writing style with time. Comput. Humanit. 38(1), 61–82 (2004)

    Article  Google Scholar 

  5. Canby, H.S.: Turn West, Turn East: Mark Twain and Henry James. Biblo & Tannen Publishers, New York (1951)

    Google Scholar 

  6. Coleman, C.: The Treatise Lorenzo Valla on the Donation of Constantine: Text and Translation. First published 1922. Russell & Russell, New York (1971)

    Google Scholar 

  7. Daelemans, W.: Explanation in computational stylometry. In: Computational Linguistics and Intelligent Text Processing, pp. 451–462. Springer, New York (2013)

    Google Scholar 

  8. Davies, M.: The Corpus of Historical American English: 400 million words,1810–2009. http://corpus.byu.edu/coha/, vol. 24, p. 2011 (2010). Accessed 24 Aug 2015

  9. Frontini, F., Lynch, G., Vogel, C.: Revisiting the donation of constantine. In: Kibble, R., Rauchas. S. (eds.) Artificial Intelligence and Simulation of Behavior—Symposium: Style in Text, vol. 2008, pp. 1–9 (2008)

    Google Scholar 

  10. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  11. Hoover, D.L.: Corpus stylistics, stylometry, and the styles of Henry James. In: Style 41.2 (2007)

    Google Scholar 

  12. Jed Wing, M.K.C., et al.: caret: Classification and Regression Training. R package version 6.0–30, http://CRAN.R-project.org/package=caret (2014). Accessed 24 Aug 2015

  13. Kemper, S., et al.: Language decline across the life span: findings from the Nun Study. In: Psychology and aging 16.2, p. 227 (2001)

    Google Scholar 

  14. Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting Methods and Applications. Wiley, New York (2008)

    Google Scholar 

  15. Michalke, M.: koRpus: An R Package for Text Analysis. Version 0.04-40. http://reaktanz.de/?c=hacking&s=koRpus (2013). Accessed 24 Aug 2015

  16. Pena, E.A., Slate, E.H.: gvlma: Global Validation of Linear Models Assumptions. R package version 1.0.0.2. http://CRAN.R-project.org/package=gvlma (2004). Accessed 24 Aug 2015

  17. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.r-project (2014). Accessed 24 Aug 2015

  18. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, vol. 12, pp. 44–49. Manchester (1994)

    Google Scholar 

  19. Stamou, C.: Stylochronometry: stylistic development, sequence of composition, and relative dating. Lit. Linguist. Comput. 23(2), 181–199 (2008)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This research is supported by Science Foundation Ireland through the CNGL Programme (Grant 12/CE/I2267) in the ADAPT Centre (www.adaptcentre.ie) at Trinity College Dublin.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carmen Klaussner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Klaussner, C., Vogel, C. (2015). Stylochronometry: Timeline Prediction in Stylometric Analysis. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXII. SGAI 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-25032-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25032-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25030-4

  • Online ISBN: 978-3-319-25032-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics