• PennomiEnglish
    arrow-up
    56
    arrow-down
    1
    ·
    7 months ago
    link
    fedilink

    There’s already more than enough training data out there. The important thing that remains is to filter it so it doesn’t also include humanity’s stupidest data.

    That and make the algorithms smarter so they are resistant to hallucination and misinformation - that’s not a data problem, it’s an architecture problem.

    • FaceDeer
      arrow-up
      19
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Stupid data can be useful for training as a negative example. Image generators use negative prompts to good effect.

    • MotoAshEnglish
      arrow-up
      9
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Butbutbut my ignorant racism is the truth!! That’s why I hear it from everyone, including [insert near by relatives here]!!

      • TakumideshEnglish
        arrow-up
        3
        arrow-down
        0
        ·
        7 months ago
        link
        fedilink

        Well is the goal truth? Or a simulacrum of a human?

        • MotoAshEnglish
          arrow-up
          2
          arrow-down
          0
          ·
          7 months ago
          edit-2
          7 months ago
          link
          fedilink

          Considering not even all humans are hireable, I’d say only a fool aims for a simulacrum.

    • CanadaPlusEnglish
      arrow-up
      4
      arrow-down
      0
      ·
      7 months ago
      edit-2
      7 months ago
      link
      fedilink

      Well, it’s established wisdom that the dataset size needs to scale with the number of model parameters. Quadratically, IIRC. If you don’t have that much data the training basically won’t work; it will overfit or just not progress.

    • UltravioletEnglish
      arrow-up
      4
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      You also have to filter out the AI generated garbage that is rapidly becoming a majority of content on the internet.

  • magnetosphere
    arrow-up
    36
    arrow-down
    6
    ·
    7 months ago
    link
    fedilink

    and even AI-generated “synthetic data” as options.

    HAHAHA

    • HaggunenonsEnglish
      arrow-up
      14
      arrow-down
      5
      ·
      7 months ago
      link
      fedilink

      This is how the best chess and go computers got to be as good as they are. AI generated “synthetic data.

      • MajorasMaskForeverEnglish
        arrow-up
        18
        arrow-down
        0
        ·
        7 months ago
        link
        fedilink

        Yes and no.

        Chess bots (like Stockfish) are trained on game samples, with the goal of predicting what search path to keep looking at and which moves will result in a win. You get game samples by playing the game, so it made sense to have stockfish play itself, since the input was always still generated by the rules of chess.

        If a classifier or predictive model creates it’s own data without tying it to the rules and methods in reality, they’re going to become increasingly divorced from reality. If I had to guess, that’s what the guy in the article is referencing when talking about “sanitizing” the data. Some problems, like chess, are really easy. Mimicking human speech? Probably not

      • CanadaPlusEnglish
        arrow-up
        6
        arrow-down
        1
        ·
        7 months ago
        edit-2
        7 months ago
        link
        fedilink

        Yeah, because the human developers know the rules of chess, so it’s easy to generate or verify perfect quality games at massive scale. Natural language can’t be tackled like that; certainly not yet, probably not ever. Many have tried and failed to parse natural language algorithmically, but at the end of the day it seems to rely heavily on loose conventions and endless shared experiences. So, you need content from the wild, or you’re basically letting the AI mark its own homework.

    • EnderMBEnglish
      arrow-up
      6
      arrow-down
      0
      ·
      7 months ago
      edit-2
      7 months ago
      link
      fedilink

      I work in AI. This is very common, and lots of companies use this. It’s also very common in academia, as it’s an easy way to get data. Synthetic data can range from totally fake to techniques like machine translation to transform data from one language to another.

      When they say AI generated”, it’s probably just using one of the API’s the LLM orchestrates.

    • CanadaPlusEnglish
      arrow-up
      5
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      The human centipede, but circular.

    • FaceDeer
      arrow-up
      1
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Your laughter is misplaced. Synthetic data is a serious solution, and when it’s done right it can give better results than raw “real” data alone.

  • mPony
    arrow-up
    27
    arrow-down
    2
    ·
    7 months ago
    link
    fedilink

    In other news, the world’s wealthiest people are running out of money after burning through the entire planet. Sources say one of the world’s multi-billionaires purchased a law firm that was in bed with the RIAA roughly 10-15 years ago when music piracy was supposedly costing more money than the GDP of all the peoples of the world, combined. “The Owners” (as they have recently rebranded) have decided to collect on this unpaid debt from every living soul, and from all the multinational companies who have been long-established as having no living souls whatsoever. A nameless, faceless, pitiless representative was quoted as saying: “Resistance is futile. Your life, as it has been, is over. From this time forward, you will service us.

  • pelespiritEnglish
    arrow-up
    30
    arrow-down
    5
    ·
    7 months ago
    link
    fedilink

    Is it wrong that I hope it eats itself and implodes?

    • magnetosphere
      arrow-up
      8
      arrow-down
      1
      ·
      7 months ago
      link
      fedilink

      If it’s wrong, then I’m wrong right along with you.

    • FaceDeer
      arrow-up
      14
      arrow-down
      10
      ·
      7 months ago
      link
      fedilink

      You’re rooting for a revolutionary new technology to fail rather than get better. I’d call that wrong.

      If nothing else, AI is never going to get worse than it is now. So if that’s intolerably bad for you then improvement is the only way out.

      • OgmiosEnglish
        arrow-up
        16
        arrow-down
        3
        ·
        7 months ago
        link
        fedilink

        AI is never going to get worse than it is now

        Is that just a wild assumption, or? One phenomena that has already been witnessed with AI is that it does in fact get worse if it trains upon it’s own output.

        • FaceDeer
          arrow-up
          6
          arrow-down
          3
          ·
          7 months ago
          link
          fedilink

          Given that I have locally-run AIs sitting on my home computer that I have no plan to delete (until something better comes along), then yeah, it’s never going to get worse. If all else fails I can just use the existing AI for as long as I want. It doesn’t “wear out.

          • OgmiosEnglish
            arrow-up
            2
            arrow-down
            3
            ·
            7 months ago
            link
            fedilink

            It doesn’t “wear out.

            The physical components will, and compatible components for older systems keep getting harder to come across. Computers are not immortal entities. Maintenance of older machines will continually become more labour and cost intensive over time.

            • knightly the SneptaurEnglish
              arrow-up
              6
              arrow-down
              1
              ·
              7 months ago
              link
              fedilink

              The models are digital, making copies for safekeeping is easy.

              The hardware is a computer, and computers are general-purpose. The kind that run AI models well at infrastructure scale are rather high end, but are still available off-the-shelf.

            • FaceDeer
              arrow-up
              5
              arrow-down
              1
              ·
              7 months ago
              link
              fedilink

              Computers are general-purpose machines. You can run a computer program on any computer, it may just be faster or slower depending on the computer’s capabilities.

              The AIs I run locally are also open-source, so if future computers lose compatibility with existing programs they can be recompiled for the new architecture.

              I suppose we could lose the ability to build computers entirely, but that strikes me as a much bigger and more general issue than just this AI thing.

              • OgmiosEnglish
                arrow-up
                2
                arrow-down
                3
                ·
                7 months ago
                link
                fedilink

                You can run a computer program on any computer

                Incorrect. Certain programs require certain standards for how the hardware is designed. There are already lots of old programs which can’t be run natively on modern machines, and using software to emulate a compatible environment can impact performance in more ways than just speed.

                • FaceDeer
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  7 months ago
                  link
                  fedilink

                  You’re wildly wrong about the fundamentals of computer science here. I’d be starting from first principles trying to explain further. I recommend reading up on Turing machines, or perhaps getting ChatGPT to explain it to you.

      • pelespiritEnglish
        arrow-up
        2
        arrow-down
        0
        ·
        7 months ago
        link
        fedilink

        You’re rooting for a revolutionary new technology to fail rather than get better

        As long as the oligarchs who run and own these AI systems are at the helm, yes I’m rooting for it to fail. Better is in the eyes of the beholder. Because come on, we all know better is going to be defined as better for the oligarchs, not you or me.

        • FaceDeer
          arrow-up
          1
          arrow-down
          0
          ·
          7 months ago
          link
          fedilink

          I run my own AI models on my own home PC. Am I an oligarch?

      • Amerikan PharaohEnglish
        arrow-up
        4
        arrow-down
        3
        ·
        7 months ago
        edit-2
        7 months ago
        link
        fedilink

        There’s nothing ‘revolutionary’ about a mass theft machine until EVERYONE IT’S STEALING FROM is getting paid out of the thieves’ pockets for what was stolen from them; and the people that run it make no profit from it. Til then, it’s just business as usual out of the west’s necrocapitalists; and your business makes me vomit.

  • givesomefucksEnglish
    arrow-up
    22
    arrow-down
    4
    ·
    7 months ago
    link
    fedilink

    Then again, there is another obvious solution to this manufactured problem: AI companies could simply stop trying to create bigger and better models, given that aside from the training data shortage, they also use tons of electricity and expensive computing chips that require the mining of rare-earth minerals.

    It’s always been a boondoggle

    But there has to be something investors don’t understand that they’ll dump billions into.

    • DeceptichumEnglish
      arrow-up
      10
      arrow-down
      8
      ·
      7 months ago
      edit-2
      7 months ago
      link
      fedilink

      Might as well stop producing new GPUs entirely, video games, video editing, shit basically anything done of a computer outside is a waste of electricity and rare earth minerals.

      We don’t even need search engines, let’s go back to libraries and paper books!

      As long as it’s not housing or food, we don’t need it. Let’s go full fucking anprim because anything else isn’t required to survive and is a waste of resources.

      • zurohkiEnglish
        arrow-up
        3
        arrow-down
        0
        ·
        7 months ago
        link
        fedilink

        Teaching sand to think was a mistake.

      • NuddingEnglish
        arrow-up
        2
        arrow-down
        1
        ·
        7 months ago
        link
        fedilink

        In your sarcastic drivel, you were correct.

        We should stop wasting electricity for recreation. We should stop mining rare earth metals.

          • NuddingEnglish
            arrow-up
            3
            arrow-down
            1
            ·
            7 months ago
            link
            fedilink

            Much like every other human alive, I’m a hypocrite.

    • Immersive_MatthewEnglish
      arrow-up
      4
      arrow-down
      3
      ·
      7 months ago
      link
      fedilink

      They have already moved onto synthetic data though and doing fine with training bigger models.

      • givesomefucksEnglish
        arrow-up
        4
        arrow-down
        3
        ·
        7 months ago
        link
        fedilink

        I was going to quote the part of the article about that, but it’s most of the article.

        You should just read it.

          • givesomefucksEnglish
            arrow-up
            3
            arrow-down
            1
            ·
            7 months ago
            edit-2
            7 months ago
            link
            fedilink

            You expected me to go and read all your other comments to understand your one reply to me?

            Who has time to do that? Like, not just once, but to do it everytime someone replies to you?

            And even if I had, that was the first one in this thread

            Out of morbid curiosity, what are you even talking about about?

            • Immersive_MatthewEnglish
              arrow-up
              3
              arrow-down
              4
              ·
              7 months ago
              link
              fedilink

              Of course not. It is comments right here on this thread and if you are going to take a shot, you need to take a moment to get your facts together.

              What I am talking about is the AI industry already ran out of data well over a year or so ago and have been using synthetic data ever since. The authors of the article clearly know this, but wanted to spin it as if it is an issue when it is not to get the haters to click.

    • drislandsEnglish
      arrow-up
      13
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Easy – their methods aren’t sufficient to begin with. No amount of training data would be enough. But perhaps they can develop new methods with what they’ve learned.

    • PilgrimEnglish
      arrow-up
      1
      arrow-down
      0
      ·
      6 months ago
      link
      fedilink

      Bro please just a little more data and we’ll have AGI, please just make another internet worth of data please bro

  • Immersive_MatthewEnglish
    arrow-up
    7
    arrow-down
    3
    ·
    7 months ago
    link
    fedilink

    While the article makes a big deal about a lack of data and even hint at synthetic data as an option, the truth is synthetic data is already being used and is just as good apparently at training. Such a misinformation article designed to stir the AI haters especially the headline.

    • voidxOPMEnglish
      arrow-up
      5
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      They seem to be experimenting with that for sure, but need to ensure quality of the model doesn’t degrade, as per source article:

      Anthropic’s chief scientist, Jared Kaplan, said some types of synthetic data can be helpful. Anthropic said it used “data we generate internally” to inform its latest versions of its Claude models. OpenAI also is exploring synthetic data generation, the spokeswoman said.

  • kakesEnglish
    arrow-up
    4
    arrow-down
    0
    ·
    7 months ago
    link
    fedilink

    Imo we’ve clearly hit a limit with vertical scaling of data. We need some kind of breakthrough on better ways to process what data we’ve got if we want to continue making meaningful progress.

    • CanadaPlusEnglish
      arrow-up
      2
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      So, basically, back to the way the field was for the preceding 60 years.