Rose here. Also @umbraroze for non-kbin stuff.

  • 0 Posts
  • 31 Comments
Joined 1 year ago
cake
Cake day: June 14th, 2023

help-circle
  • Reddit has an user data checkout feature (IIRC, check out the user settings or maybe reddit help pages to find it).

    It’s a bit crap though.

    It takes a long time to process, especially if you happened to post in the era when the Reddit data infrastructure was horribly terrible instead of merely ordinarily terrible, and apparently this involves some handwork in the worst cases on behalf of the staff.

    Some data may be missing or truncated. It doesn’t give you data from privated/banned subreddits (which was a fun thing to discover because last time I tried to do this the blackouts were on), and even for legit stuff, long comments/posts may be truncated. Even so, I’m pretty sure that the dumps just straight up didn’t have all of my posts from several years ago, even if those were on public subreddits. So you need to make sure the checked out data is sensible.

    In conjunction to the official dumps, I recommend a few other tools, especially since the dumps aren’t really magnificently usable on their own. One tool that I found personally invaluable is reddit-user-to-sqlite, which allows you to import Reddit data dumps and available live user data (I think it does this by scraping or something, I’m sure it worked despite the API being shut down) to sqlite database, and Datasette is a nice frontend for browsing the posts.

    As for scrubbing, there’s tools for that are supposed to work. I think.


  • Yup. The robots.txt file is not only meant to block robots from accessing the site, it’s also meant to block bots from accessing resources that are not interesting for human readers, even indirectly.

    For example, MediaWiki installations are pretty clever in that by default, /w/ is blocked and /wiki/ is encouraged. Because nobody wants technical pages and wiki histories in search results, they only want the current versions of the pages.

    Fun tidbit: in the late 1990s, there was a real epidemic of spammers scraping the web pages for email addresses. Some people developed wpoison.cgi, a script whose sole purpose was to generate garbage web pages with bogus email addresses. Real search engines ignored these, thanks to robots.txt. Guess what the spam bots did?

    Do the AI bros really want to go there? Are they asking for model collapse?



  • I’m using Finnish keyboard layout (same as Swedish basically).

    I like how AltGr+7/8/9/0 gives me { [ ] }, it’s a very nice grouping. The key next to Z is < > and you get | with AltGr, which is very handy.

    Only thing that’s mildy annoying from programming viewpoint is that for tilde and backtick, the keys do diacritics - you need to press the diacritic key and space. Backtick is especially fun, because it’s shift+acute, space. Meanwhile, the key next to 1 does § ½, which aren’t that handy most of the time. I often just stick backtick on that key if I’m particularly assed to customise keyboard keyouts. Similarly, shift+4 is ¤, which is another not a particularly useful character (but I don’t mind that, because £ $ € all need to be produced with AltGr, which is at least consistent).


  • I’m, like, OK, nuclear power isn’t necessarily a bad thing.
    But power plants like that should probably serve wider municipal needs.

    Building a private nuclear power plant just to power a data center? Well that’s clearly stupid.
    Building a private nuclear power plant just to power a data center focused on a niche application? Well you know how that goes.

    Also, look up SL-1. Disturbingly few Americans I’ve talked to have heard about that. Generally a good argument about why not every single thing should be powered by a tiny dedicated nuclear reactor.


  • In middle of a couple of worldbuilding projects. Haven’t really had much good ideas for the fantasy project lately.

    Ah HA! Maybe I’ll do some mild subversion of expectations.
    Maybe one of the most famous sites in this world, where people come to visit from far and wide, has a tiny old withered tree.
    …I mean, there could be a lot of legitimate logical reasons why this site could me important. Maybe the tree has a really fascinating story behind it.
    Heck, there’s probably many such places on our world too! Can think of at least one from the top of my mind.
    I should write this down.

    Last year I felt really crappy as far as my writing projects go, but in the last few months, if there’s one thing I’ve learned it’s that even smallest ideas can sometimes break the writer’s block. Keep writing them down!



  • There are a lot of people who go “I tried to learn X through Duolingo and failed”. Sure, that’s probably true, because staring at the app is not how language learning works. Much like 100 years ago, people would have said you can’t learn a language by reading a single book.

    Duolingo is great for basics of the language, vocabulary and constant daily lessons. But you always need more. There’s a whole language sphere out there. People actually using the language and whatnot.

    I started studying French through Duolingo and about 6 months later I was like “I really need a grammar book and a dictionary, dammit”. Year in, I was like “I should try reading news in French and maybe try a book.”


  • My theoretical answer is this: in an ideal world, there would be no copyright at all. This is an artificial contrivance that was once dreamed up to serve physical-copy economy, and it was rendered obsolete by the digital age. Shit would be so much easier when we got rid of this shit and everyone could share everything by default without any profit motive. (Caveat: This will not work unless literally every jurisdiction on the planet gets rid of copyright laws all at once, otherwise this is way too exploitable due to power imbalance. So I don’t think this is a practical proposition. *cough* unless we all decide Anarchism is a good idea after all *cough*)

    My practical answer is this: Welllllll we’re kinda damned if we do and we’re damned if we don’t. My personal feeling is that AI creations aren’t really copyrightable, and even suggesting they are copyrightable is kind of opening a huge can of worms regarding what exactly counts as “creativity” in the first place. The best we can do under current copyright regime is to regulate how the AI datasets are curated, because goodness knows the current datasets weren’t exactly ethically obtained.






  • I was a Slashdot user.

    People kept hyping Digg as a Slashdot replacement, but trying to submit posts was actually even more futile in practice than trying to submit articles to Slashdot editors. So much bigger hivemind too. Boring unfunny comment section.

    When I first joined Reddit, it seemed like it was mostly populated by Slashdot refugees. Just people posting awesome shit. Great riveting discussions, even before anyone actually read the articles. That sort of stuff.


  • Depends on the type of account, but here are some of the common methods of how this might happen:

    • The attacker could be straight up guessing the password. (One possible way to mitigate this: the website can go “wow, 10 failed login attempts from that source. I’m going to ignore all attempts from there for 24 hours.”)
    • The attacker could be using previously exposed passwords. (One possible way to mitigate this: The websites should immediately require password reset for all users when that kind of data breach happens. For users: never use same password for multiple different services, certainly never reuse a compromised password even if it’s for a different service. Also: haveibeenpwned.com)
    • The attacker, currently using the same network, could hijack the session. (This was a really huge problem back in the day. In this day and age, websites should be using HTTPS, which limits this very much. Still possible if the site doesn’t use HTTPS, and through some other vectors, e.g. malware or hijacked network hardware).

    Also: Malware is a really scary big problem in that they’re rarely targeting you specifically. Why do that, when they can million people at the same time and sift through that stolen data for most valuable stuff, right?


  • Well, since it seemed to be a way to support the site and get to see new features ahead of time, so yeah, why not? I only decided not to renew my gold access when it became very clear Spez wouldn’t ban the hate subs he loved.

    As for getting gold otherwise:

    I’m an introvert, ok? I mostly only comment if I have something worthwhile to say.

    So the only comments I ever got gilded by others were drunken shitpost. And in one instance some random off the cuff post. …I don’t get it.

    Anyway. Basically, I didn’t want to post any Gold Baits™. because that way lies madness.


  • Been using a Suunto 5 Peak watch since May and it’s been absolutely great. Dunno if 250€ counts as inexpensive, but like we say in Finland, poor people can’t afford to buy cheap shit that breaks right away. (I think they have cheaper options?) Suunto watches talk to phone app which at least on Android is pretty great, and the app can talk to other services which can analyse stuff further.


  • I was a reddit user for ages. Reddit search always sucked. Heck, Reddit could barely make their own data available to the users (which is why their user histories are so limited and why the GDPR takeouts take a week). Everyone, and I mean EVERYONE, used external search engines.

    Do they want to block external searches? Literally enshittify their shit further? Are they willing to hold back progress?

    Just today I was thinking of Reddit Gold - back when I actually paid for it, the marketing spin was “you get to test new features before we add them to everyone else!” Literally none of the Gold features I’ve ever used made to the unwashed masses. I take it back, saving comments did.

    So yeah, they will hold back progress. In fact, progress isn’t on the cards. It’s just regress. AND you can be a premium user and PAY for it.