Research:Wiki-Pie: A Policy Invocation and Enactment English Wikipedia Dataset
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
Policies that can be edited and enforced by anyone in the community are crucial to how Wikipedia functions. This project develops a dataset tracking when editors invoke a policy or likely enact it, focusing on English Wikipedia's core content policies: viewpoint neutrality (WP:NPOV), verifiability (WP:V), and no original research (WP:NOR).
Background
[edit]Policy is core to how people collaborate in Wikipedia and other complex peer production organizations.[1] Formalized, documented, and enforced policies let peer-producers self-impose standards for their work, enabling collaboration across differences in experience, viewpoint, and approach.[2]
Many quantitative and qualitative studies have analyzed Wikipedia's policies as texts [3][4] without directly speaking to policy use. Others have tracked hyperlinks to policy or policy templates to support large-scale quantitative analysis of policy use.[5] Although such work is valuable, it is limited by a reliance on explicit mentions which are but a small part of how Wikipedians use policy. Contributors may reference policy indirectly by using related keywords. Explicit invocations (e.g., "WP:NPOV issues") may serve different communicative purposes compared to implicit ones (e.g., "attribute disputed claim to source"). As rules are used more, they may become increasingly taken for granted, making explicit invocation less important.
Moreover, just knowing when editors invoke a policy tells us little about how the policy shapes their collaborative practice. Many policies, not just on Wikipedia, have room for interpretive ambiguity that helps them become widely used. Thus the same policy (e.g., no harassment) can be used in potentially conflicting ways, even within one community.
Data on policy use on Wikipedia are crucial to building a deeper understanding of how policies shape the content and behavior on Wikipedia. We aim to build a dataset that covers the entire edit history of English Wikipedia as a step toward quantitative analyses that might identify types of policy misuse, evaluate efficacy of enforcement mechanisms, measure adherence to the spirit vs letter of the law, and trace policy use as the project has developed.
Research questions
[edit]We define and focus on two types of policy use:
- Invocation: When editors invoke a policy by referring to it, both directly and indirectly, possibly casually
- Enactment: When editors likely use a policy in practice by improving an article's compliance with it
Our research questions include:
- How can we reliably identify when editors invoke or enact Wikipedia's core content policies?
- What proportion of edits invoke vs. enact each policy?
- Do editors who invoke a policy typically also enact it?
Methods
[edit]Viewing the measurement of policy invocation and enactment as a classification problem, we are using an automated content analysis pipeline to construct our large-scale dataset of article revisions.[6] To achieve construct validity, we follow an iterative content analysis methodology with our research team.[7]
For each policy, we begin by studying a recent version of the policy text. We then move through an iterative process:
- Take a probability sample of 200 non-bot Wikipedia article revisions stratified according to the predictions of an LLM prompted with the policy text and initial classification instructions
- Research assistants independently code the sample
- Calculate disagreement according to Gwet's AC (due to class imbalance, AC is preferred to Krippendorff's alpha)
- Meet weekly to discuss disagreements, sharpen our understanding of the policies, and develop a shared protocol for systematic content analysis
- Revise the LLM prompt accordingly, adding challenging few-shot examples with explanations, improving instructions per our protocol, and selectively adding context from the article's edit history
- Repeat until the team reaches a high level of agreement
Protocol for Invocation
[edit]Our protocols for invokes look for explicit signals in the revision, i.e., in its edit summaries and content added or removed such as templates placed on the main page. This includes not only the wikilink to the core policy page but also those to related policies, guidelines, and essays as well as keywords. Interpreting an edit summary often requires examining a revision's changes and other context.
Protocol for Enactment
[edit]Our protocols for enacts focus on whether an edit unambiguously makes a change that improves how the rendered article accords with the policy. We aim to avoid speculating about editors' intentions while staying open to the broad range of ways that a policy can be enacted and assuming good faith of the editor.
With WP:V, we focus on actual changes to citations that either add new sources or improve metadata to make sources more verifiable.
With WP:NPOV, we focus on changes reflecting NPOV considerations such as slanted language, due emphasis, and whether sourced claims should be attributed or stated as facts.
Preliminary results
[edit]Inter-rater reliability
[edit]We have completed preliminary protocols for WP:V invocation and enactment and a draft protocol for WP:NPOV invocation.
| Policy | Task | Coding Rounds | Gwet's AC |
|---|---|---|---|
| WP:V | Invocation | 5 | 0.94 |
| WP:V | Enactment | 5 | 0.92 |
| WP:NPOV | Invocation | 2 | 0.95 |
| WP:NPOV | Enactment | 2 | 0.88 |
LLM benchmarks
[edit]We have preliminary evidence that medium-sized open-weight LLMs are highly capable at classifying invocation and capable of classifying enactment. The best model we have tested is Qwen-2.5-72B:
| Task | Weighted Macro-avg F1 | Precision | Recall |
|---|---|---|---|
| WP:V Invocation | 0.90 | 0.89 | 0.92 |
| WP:V Enactment | 0.77 | 0.72 | 0.88 |
Preliminary Findings
[edit]Based on an LLM-classified random sample of 1000 edits:
- About 18% of edits enact WP:V, and only 30% of these invoke the policy, suggesting that work to improve verifiability rarely merits mention
- Only 6% of edits invoke WP:V, but 92% of edits that do so also enact it, suggesting that policy invocation normally honestly reflects effort to improve policy compliance
Next steps
[edit]We are currently developing preliminary protocols for each of the three core content policies independently. However, our analysis of WP:NPOV use improved our understanding of WP:V use. Therefore, we will revisit each policy's protocol in relation to the others, sharpening our distinctions between the policies.
We also plan to:
- Solicit feedback from community members on our preliminary protocols to ensure they align with emic perspectives on the policies
- Code a large sample of edits as a benchmark for classifiers and for statistical correction of misclassification bias[8][9]
- Release the Wiki-Pie dataset of classified edits along with the benchmark sample and code.
Resources
[edit]- Code repository: https://gitea.communitydata.science/groceryheist/Wikipi
- Dataset: [TBD]
- Project etherpad: https://etherpad.communitydata.science/p/wikipedia-policy-team
References
[edit]- | Kriplean, Travis; Beschastnikh, Ivan; McDonald, David W.; Golder, Scott A. (2007). "Community, consensus, coercion, control: cs*w or how policy mediates mass participation". Proceedings of the 2007 International ACM Conference on Supporting Group Work. ACM. pp. 167-176. doi:10.1145/1316624.1316648.
- | Gibbs, Jennifer L.; Rice, Ronald E.; Kirkwood, Gavin L. (2021). "Digital discipline: Theorizing concertive control in online communities". Communication Theory. doi:10.1093/ct/qtab017.
- | Matei, Sorin Adam; Dobrescu, Caius (2010). "Wikipedia's "Neutral Point of View": Settling conflict through ambiguity". The Information Society 27 (1): 40-51. doi:10.1080/01972243.2011.534368.
- | Heaberlin, Bradi; DeDeo, Simon (2016). "The evolution of Wikipedia's norm network". Future Internet 8 (2): 14. doi:10.3390/fi8020014.
- | Beschastnikh, Ivan; Kriplean, Travis; McDonald, David W. (2008). "Wikipedian self-governance in action: Motivating the policy lens". Proceedings of the ICWSM 2. AAAI. pp. 27-35.
- | Grimmer, Justin; Roberts, Margaret E.; Stewart, Brandon M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press. ISBN 978-0-691-20799-5.
- | Krippendorff, Klaus (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). SAGE Publications. ISBN 978-1-5063-9567-8.
- | TeBlunthuis, Nathan; Hase, Valerie; Chan, Chung-Hong (2024). "Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!". Communication Methods and Measures 18 (3): 278-299. doi:10.1080/19312458.2023.2293713.
- | Egami, Naoki; Hinck, Musashi; Stewart, Brandon M.; Wei, Hanying (2024). "Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses".