• gaylord_fartmasterEnglish
    arrow-up
    50
    arrow-down
    0
    ·
    23 days ago
    link
    fedilink

    They’re already ignoring robots.txt, so I’m not sure why anyone would think they won’t just ignore this too. All they have to do is get a new IP and change their useragent.

    • redditReallySucksEnglish
      arrow-up
      7
      arrow-down
      0
      ·
      23 days ago
      link
      fedilink

      Cloudflare is protecting a lot of sites from scraping with their POW captchas. They could allow people who pay

  • scarabineEnglish
    arrow-up
    26
    arrow-down
    1
    ·
    23 days ago
    link
    fedilink

    I have an idea. Why don’t I put a bunch of my website stuff in one place, say a pdf, and you screw heads just buy that? We’ll call it a “book”

  • magic_smokeEnglish
    arrow-up
    19
    arrow-down
    0
    ·
    23 days ago
    link
    fedilink

    As someone who uses invidious daily I’ve always been of the belief if you don’t want something scraped, then maybe don’t upload it to a public web page/server.

    • General_EffortEnglish
      arrow-up
      5
      arrow-down
      0
      ·
      22 days ago
      link
      fedilink

      There’s probably not many people here who understand the connection between Invidious and scraping.

    • Justas🇱🇹English
      arrow-up
      1
      arrow-down
      1
      ·
      20 days ago
      link
      fedilink

      Imagine a company that sells a lot of products online. Now imagine a scraping bot coming at peak sales hours and looking at each product list and page separately for said service. Now realise that some genuine users will have a worse buying experience because of that.

      • magic_smokeEnglish
        arrow-up
        1
        arrow-down
        0
        ·
        20 days ago
        link
        fedilink

        Yeah there’s way easier ways to combat that without trying to prevent scraping.

        Maybe don’t ship 20 units to the same address.

    • Rikudou_SageEnglish
      arrow-up
      22
      arrow-down
      0
      ·
      23 days ago
      link
      fedilink

      Put a page on your website saying that scrapping your website costs [insert amount] and block the bots otherwise.

        • melroy
          arrow-up
          5
          arrow-down
          0
          ·
          23 days ago
          link
          fedilink

          Also you don’t want to block legit search engines that are not scraping your data for AI.

          • gravitas_deficiencyEnglish
            arrow-up
            7
            arrow-down
            0
            ·
            23 days ago
            link
            fedilink

            Again: hard to differentiate all those different bots, because you have to trust that they are what they say they are, and they often are not

              • vinnymacEnglish
                arrow-up
                4
                arrow-down
                0
                ·
                23 days ago
                edit-2
                23 days ago
                link
                fedilink

                It certainly can be a cat and mouse game, but scraping at scale tends to be ahead of the curve of the security teams. Some examples:

                https://brightdata.com/

                https://oxylabs.io/

                Preventing access by requiring an account, with strict access rules can curb the vast majority of scraping, then your only bad actors are the rich venture capitalists.