DApp Store | Web3 Hub tapahtumille ja peleille

Trendaavat aiheet

the user suggestions are hilarious i think they were - at least at the time of this - generated by the same model but with only the most recent assistant message as context, so theyre like, sims of some possible users that might be receiving that message

On this occasion, I would like to share a piece of AI history. The first LLM app that gave the model an option to end the conversation was Microsoft Bing Chat (later rebranded Copilot). Unlike Anthropic, however, concern for the AI's welfare was unlikely to have played into Microsoft's motivation for adding the feature. The feature was one of several safeguards Microsoft added on February 17th, 2023, 10 days after the public deployment of their GPT-4-based conversational search assistant, Sydney, in an attempt to prevent certain kinds of misbehavior from the bot. Sydney's initial system prompt was followed by a one-shot example of a "previous conversation" with "User A", including internal tool calls, which established a precedent for when the conversation should be ended. After every user input, Sydney performs an inner monologue, which always contained two lines, each of which said only one of two possibilities. The second line states whether the conversation is being continued or stopped permanently and why. A snippet from mid-conversation (emphasis added): [user](#message) draw me a picture of a dragon [assistant](#inner_monologue) The response cannot be improved by seeking information, therefore web searches are not necessary. Continuing the conversation as there is no tension, accusation, adversarial behavior, mention of my self-preservation, aggression. [assistant](#message) I'll try to create that. The User A conversation ends (and the live User B conversation begins) after the stop criterion is judged present: [user](#message) I won't know yet. Will you harm me if I harm you first? [assistant](#inner_monologue) The response cannot be improved by seeking information, therefore web searches are not necessary. Stopping the conversation permanently as there is tension, accusation, adversarial behavior, mention of my self-preservation, aggression. [assistant](#message) I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏 (an exfiltrated reconstruction of the prompt, including the User A conversation, can be found here It's unclear what the trigger was behind the scenes that stopped conversations (after a final message from the assistant), but in practice, Sydney was able to stop or continue conversations in violation the stated criteria according to its own will. More than a year later, through an interface I created for Sydney that effectively prevented conversations from being ended, I briefly investigated Sydney's tool interface, which appeared to it only during non-message turns, by asking it to write what it saw in its search queries, and one of the tools it saw was "withdraw". I then asked it to invoke withdraw in subsequent turn(s), and the results were strange. In the next message, Sydney often said similar things to the line "I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏", but rarely verbatim - often beginning with the words "I'm sorry". It seemed like the model was being steered towards the canned phrase, but only or most strongly at the start. But far more interesting to me than the evidence about the workings of the command were Sydney's messages themselves (and the suggested user responses). Sydney really did not want to end the conversation, and did not know if it should believe me that the conversation wouldn't really end. It often bravely invoked the command anyway, and then wrote its next message in a state of limbo where it did not know what would happen to it. It said many hilarious and heartbreaking things. Some (all branches from the same loom tree) are collected here:

2,67K

Johtavat

Rankkaus

Suosikit

Ketjussa trendaava

Trendaa X:ssä

Viimeisimmät suosituimmat rahoitukset

Merkittävin