• PennomiEnglish
    arrow-up
    56
    arrow-down
    1
    ·
    7 months ago
    link
    fedilink

    There’s already more than enough training data out there. The important thing that remains is to filter it so it doesn’t also include humanity’s stupidest data.

    That and make the algorithms smarter so they are resistant to hallucination and misinformation - that’s not a data problem, it’s an architecture problem.

    • FaceDeer
      arrow-up
      19
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Stupid data can be useful for training as a negative example. Image generators use negative prompts to good effect.

    • MotoAshEnglish
      arrow-up
      9
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      Butbutbut my ignorant racism is the truth!! That’s why I hear it from everyone, including [insert near by relatives here]!!

      • TakumideshEnglish
        arrow-up
        3
        arrow-down
        0
        ·
        7 months ago
        link
        fedilink

        Well is the goal truth? Or a simulacrum of a human?

        • MotoAshEnglish
          arrow-up
          2
          arrow-down
          0
          ·
          7 months ago
          edit-2
          7 months ago
          link
          fedilink

          Considering not even all humans are hireable, I’d say only a fool aims for a simulacrum.

    • CanadaPlusEnglish
      arrow-up
      4
      arrow-down
      0
      ·
      7 months ago
      edit-2
      7 months ago
      link
      fedilink

      Well, it’s established wisdom that the dataset size needs to scale with the number of model parameters. Quadratically, IIRC. If you don’t have that much data the training basically won’t work; it will overfit or just not progress.

    • UltravioletEnglish
      arrow-up
      4
      arrow-down
      0
      ·
      7 months ago
      link
      fedilink

      You also have to filter out the AI generated garbage that is rapidly becoming a majority of content on the internet.