• magnetosphere
    arrow-up
    36
    arrow-down
    6
    ·
    7 months ago
    link
    fedilink

    and even AI-generated “synthetic data” as options.

    HAHAHA

    • HaggunenonsEnglish
      arrow-up
      14
      arrow-down
      5
      ·
      7 months ago
      link
      fedilink

      This is how the best chess and go computers got to be as good as they are. AI generated “synthetic data.

      • MajorasMaskForeverEnglish
        arrow-up
        18
        arrow-down
        0
        ·
        7 months ago
        link
        fedilink

        Yes and no.

        Chess bots (like Stockfish) are trained on game samples, with the goal of predicting what search path to keep looking at and which moves will result in a win. You get game samples by playing the game, so it made sense to have stockfish play itself, since the input was always still generated by the rules of chess.

        If a classifier or predictive model creates it’s own data without tying it to the rules and methods in reality, they’re going to become increasingly divorced from reality. If I had to guess, that’s what the guy in the article is referencing when talking about “sanitizing” the data. Some problems, like chess, are really easy. Mimicking human speech? Probably not

      • CanadaPlusEnglish
        arrow-up
        6
        arrow-down
        1
        ·
        7 months ago
        edit-2
        7 months ago
        link
        fedilink

        Yeah, because the human developers know the rules of chess, so it’s easy to generate or verify perfect quality games at massive scale. Natural language can’t be tackled like that; certainly not yet, probably not ever. Many have tried and failed to parse natural language algorithmically, but at the end of the day it seems to rely heavily on loose conventions and endless shared experiences. So, you need content from the wild, or you’re basically letting the AI mark its own homework.

    • EnderMBEnglish
      arrow-up
      6
      arrow-down
      0
      ·
      7 months ago
      edit-2
      7 months ago
      link
      fedilink

      I work in AI. This is very common, and lots of companies use this. It’s also very common in academia, as it’s an easy way to get data. Synthetic data can range from totally fake to techniques like machine translation to transform data from one language to another.

      When they say AI generated”, it’s probably just using one of the API’s the LLM orchestrates.

    • CanadaPlusEnglish
      arrow-up
      5
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      The human centipede, but circular.

    • FaceDeer
      arrow-up
      1
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Your laughter is misplaced. Synthetic data is a serious solution, and when it’s done right it can give better results than raw “real” data alone.