• StovetopEnglish
    arrow-up
    17
    arrow-down
    1
    ·
    8 months ago
    link
    fedilink

    This is only going to be adding recent Reddit data.

    A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It’s going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

    • kromemEnglish
      arrow-up
      7
      arrow-down
      0
      ·
      8 months ago
      edit-2
      8 months ago
      link
      fedilink

      It’s not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what’s typically being used today, and there’s been separate research that you can enhance models significantly using synthetic data from SotA models.

      The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.