Dark Mode

Jump to content

Commons:Bots/Requests/FlaschBot1

From Wikimedia Commons, the free media repository

Operator: Fl.schmitt (talk * contributions * Statistics * Recent activity * block log * User rights log * uploads * Global account information) (assign permissions)

Bot's tasks for which permission is being sought: Add {{Information}} to Media missing infobox template. See exhaustive preparative discussion on Commons:Bots/Work_requests#Media_missing_infobox_template. The bot tries to put as much information as possible into SDC fields (author, source, captions, date), since {{Information}} uses those data as default.

Automatic or manually assisted: Manually assisted. The bot follows "divide and conquer" tactics. Since it seems to be impossible to apply one solutions to > 300,000 media files lacking an infobox template, it will work on sets of files, usually defined by same author / creator (assuming that those files share sufficient similarities). The bot will be run multiple times on that set of files in different modes. First, analyze the file page content and try to categorize each of its components, without modifying and content on Commons. This step will be repeated (manually) as often as needed to adapt the categorization patterns, until a pattern set that fits for all file pages of the current set has been found. Now, a "dry-run" ("simulation") generates an overview over the "planned" modifications (see txt and SQLite analysis and simulation results for Category:Media missing infobox template (maps t1)). Only if this simulation result seems acceptable, the bot will run in "doit" mode to apply the "proposed" edits.

Edit type (e.g. Continuous, daily, one time run): Multiple times a week, but not daily.

Maximum edit rate (e.g. edits per minute): Maybe 5-6 per Minute?

Bot flag requested: (Y/N): Y

Programming language(s): pywikibot

Fl.schmitt (talk) 21:56, 6 September 2024 (UTC)[reply]

Discussion

It sounds to me like mostly manual work, which should not be done with a bot account. Am I mistaken? --Krd 13:27, 26 September 2024 (UTC)[reply]

I don't know if this sort of task should be done by a bot account. I assumed that doing such edits automatically on numerous files using a script is a typical task for a bot.
Anyway, in my opinion it's mostly bot work, for example automatically checking the initial upload date of a file and using it as the latest value for SDC inception if there's no creation date available in the unstructured file description. That's work for a python script and not for a human being... Additionally, it's not feasibly to edit both SDC and text content of a file page manually, setting multiple SDC values for a single property (e.g. creator). In short: the focus lies on the script-based application of rules that were created "intellectually".
Of course, there's a "manual" part: checking and adapting the regex rules to identify/categorize the unstructured content, to detect patterns how e.g. creators are mentioned. Additionally, some files may require manual intervention if a rule would only apply to single files. But once the regex rules are defined, they can and should be applied to the complete input set. And that's a bot's task, isn't it? Fl.schmitt (talk) 15:46, 26 September 2024 (UTC)[reply]
You stated above it is working manually assisted, and I'd agree that fully automatically it's not even possible. E.g. in this: Special:Diff/922463275 edit the "source" is still a mess. Even if the bot is flagged, this edit IMO should not be done with a bot flag. (And perhaps it should not be done automatically at all if there is a risk of getting source and author information messed, because it may affect file attribution.) Different opinions welcome. Krd 19:04, 1 October 2024 (UTC)[reply]
Hmm - that "mess" could be done better if there were clear guidelines documented on {{Information}} how to reference WP imports as source. Using {{Source}} seemed the best option to me. If it isn't, I would be glad about advise how to do the "source" statement the correct way in such cases. But maybe it isn't worth the time, I've already wasted a lot of it trying to find a solution that looked viable. Fl.schmitt (talk) 19:27, 1 October 2024 (UTC)[reply]
Yes, this is much needed, please get this bot to work to add the standardized well-established expectable Information template which can be queried or displayed e.g. in apps etc. However, it probably needs many tests and examples to make it add much information to these templates or add categories for manual maintenance where needed. --Prototyperspective (talk) 10:51, 3 October 2024 (UTC)[reply]
I thought the bot would add the Information template to files that lack them. Sorry, I misunderstood. Prototyperspective (talk) 17:43, 11 October 2024 (UTC)[reply]
@Prototyperspective: no, that's exactly the bot's task - adding {{Information}} to file that lack them, but in combination with SDC. Since {{Information}} uses some SDC values as default, there's IMHO no need and no use to save those values in a redundand way woth in SDC and as template parameters. SDC can be queried, too. So, where's the misunderstanding? Fl.schmitt (talk) 19:18, 11 October 2024 (UTC)[reply]
Okay thanks for explaining then it's basically what I thought it was. Such a bot is much needed. Also see Commons:Village pump/Technical#How to search fields of files' Information template? (I think using the insource search operator combined with regex could be the solution to it). Prototyperspective (talk) 21:34, 11 October 2024 (UTC)[reply]

Is the whitespace around the "int:filedesc" intended? This appears uncommon and badly readable to me. --Krd 09:55, 11 October 2024 (UTC)[reply]

Good point - not sure why I've put it there. This is fixed now. Fl.schmitt (talk) 14:34, 11 October 2024 (UTC)[reply]
Well, the outer blanks are common, I'd suggest to remove only the inner ones. Krd 14:38, 11 October 2024 (UTC)[reply]
Done :-) Fl.schmitt (talk) 14:49, 11 October 2024 (UTC)[reply]

I think the bot could be flagged but should edit without actually setting bot action for an extended slow test run. Do you agree? --Krd 09:55, 11 October 2024 (UTC)[reply]

No problem, that's perfectly fine for me. Let's give it a try. Fl.schmitt (talk) 14:40, 11 October 2024 (UTC)[reply]
Ok, please make a slow start. Krd 14:57, 11 October 2024 (UTC)[reply]
Bot has done six additional edits (combination of SDC and page text update): Revision #938775515 / Revision #938776502 for Map of Canton Thurgau.png, Revision #938777401 / Revision #938778686 for Map of Canton Uri.png and Revision #938779544 / Revision #938780857 for File:Map of Canton Zug.png. It's exactly the same pattern as the previous test runs, except the modified headers. Preview for all pages that are current in category Category:Media missing infobox template (maps t1): simulate_result_61442.txt. That txt file shows the "proposed" SDC values to set (as JSON) and page text content to write (as wikitext). Please keep in mind that the page text (esp. the parameters of {{Information}} is "incomplete" because it relies on the SDC values that are used as defaults. Fl.schmitt (talk) 17:32, 11 October 2024 (UTC)[reply]
Please continue. Krd 05:01, 12 October 2024 (UTC)[reply]
Ok - now the bot is running continuously, but with a minimum delay of five minutes between filepages. So, the bot will visit 12 file pages per hour. Fl.schmitt (talk) 07:05, 12 October 2024 (UTC)[reply]
Update: had to fix an issue with captions - slowed bot down to 4 filepage visits per hour to keep better control. Fl.schmitt (talk) 18:14, 12 October 2024 (UTC)[reply]
Please do not remove the empty line above Categories, like in this edit [1]. This is standard in almost all files and simply reflects what is done and increases readability. --Schlurcher (talk) 08:50, 1 November 2024 (UTC)[reply]
Is the bot still running? Krd 14:44, 7 November 2024 (UTC)[reply]
No, currently not. The first "batch" of files is finished, so I'll have to adapt the regex patterns to the next batch, which will take some time. I'll report back if the bot is ready to run again. Fl.schmitt (talk) 20:01, 7 November 2024 (UTC)[reply]
What is the estimated time frame? Krd 04:00, 20 November 2024 (UTC)[reply]
Maybe mid december? The problem is that I can't simply start running the bot again, but have to adapt it to a new set of files. Anyway, I don't think there's a hurry for the bot's work, since most of the "target" files are lacking {{Information}} since almost 20 years. Fl.schmitt (talk) 13:00, 20 November 2024 (UTC)[reply]

Stale. Please request to re-open when ready. --Krd 16:05, 23 December 2024 (UTC)[reply]