What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I’m slowly working my way into the codebase for Textgen, and will hopefully get to the point where I can directly use the command line for prompting.

The llama.cpp Python API is super simple to use and you don’t need to dig into the text-generation-webui codebase at all. Literally just:

import llama_cpp_cuda as llama_cpp    # use llama_cpp_cuda version for support for running GGML models on the GPU

model = llama_cpp.Llama(model_path="", seed=-1, n_ctx=2048, n_gpu_layers=28, low_vram=True)    # use whatever settings here that you would set in text-generation-webui when loading the model, make sure to include n_gqa=8 when using LLaMa v2 70B model

# now you can either do things with the "all-in-one" API...
text = model.create_completion(prompt, max_tokens=200, temperature=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1)    # you pass your temperature, top_p, top_k, etc. settings here, these are the same as the settings in text-generation-webui, note that you don't need to pass all the parameters e.g. you can leave out mirostat parameters if you aren't using mirostat mode

# ...or the "manual" way
prompt_tokens = model.tokenize(prompt.encode('utf-8'))
model.reset()
model.eval(prompt_tokens)
generated_tokens = []
while True:
    next_token = model.sample(temp=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1)
    if next_token != model.token_eos():
        generated_tokens.append(next_token)
        model.eval([next_token])
    else:
        break
text = model.detokenize([generated_tokens]).decode('utf-8')

See the documentation here for more information: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/ You only really need to pay attention to __init__(), tokenize(), detokenize(), reset(), eval(), sample(), and generate(). create_completion() provides an “all-in-one” wrapper around eval/sample/generate that is intended to be (loosely) compatible as a drop-in replacement for the OpenAI Python library. create_chat_completion() is likewise intended to be a replacement for OpenAI but if you want direct control over the prompt format then ignore it entirely (it’s not even documented exactly how the prompt is formatted when using this function…).

Do you happen to know if the prompt processing differences in Textgen, and others like Kobold, are all arbitrary processing done before llama.cpp is called (or some similar code), or is there some other API level that more complex character prompts are tapping into?

They are not doing anything special with the model (no fancy API or anything). All they are doing is including some extra text before your input that describes the characters, scene etc. and possibly a direct instruction to roleplay as that character, and then sending that combined assembled prompt to the model/backend API as you would with any other text. Unfortunately the documentation isn’t particularly transparent about how the extra text is included (with regards to the exact formatting used, what order things appear in, etc.) and neither do the logs produced by e.g. text-generation-webui include the actual raw prompt as seen by the model.

I’m aware I’m blindly walking into this space with my arms out trying to find the walls; aware, but unworried about giant potential holes in the floor.

The key point to understand here is that all current LLMs (this may change in the future) work only with raw text. They take in some text and then generate other text that goes after it. Any more complex applications such as conversation are just layers built on top of this. The conversation is turned into a plain-text transcript that is sent to the model. The model generates the next part of the conversation transcript, which is then parsed back out and appended to the list of conversation messages. From the model’s perspective, it’s all just one continuous stream of raw text. You can always achieve exactly the same results by manually constructing the same prompt yourself and passing it directly to the model.

For example, if I pass the following string as the prompt into model.create_completion() from above

"### User:\nPlease can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block.\n\n### Response:\n"

I will get exactly the same result as if I used instruct mode in text-generation-webui with ### User: as the user string, ### Response: as the bot string, and <|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n as the turn template, and then sent the message “Please can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block.” in the chat box.

(Although imo doing it the manual way is less error-prone and guaranteed to give me exactly the prompt that I think I should be getting, noting that text-generation-webui doesn’t give me any way at all to actually verify that the prompt seen by the model is actually the way I intended it to be and it’s not as though I haven’t encountered UI bugs before where the produced formatting doesn’t match what I entered…)

Like if the reply changes writing perspective context arbitrarily, I need to recall the last question, alter it, and regenerate.

You don’t necessarily need to alter your question in that case, often just regenerating is enough to “fix” this. This is, as I have said, particularly an issue with the LLaMa 2 non-chat models as they aren’t specifically trained to follow a conversation, so sometimes they will arbitrarily decide to provide a commentary or reaction to the conversation or they see the conversation as part of a webpage and try to generate a heading for the next part of an article or some other such seemingly-“random” behavior instead of continuing the conversation itself. If that happens just regenerate the response until the RNG works out in your favor and the model starts writing in the correct role. Once it starts writing a particular “type” of output it will generally keep writing in the same role until it has finished.

Sometimes it is also helpful to write the first part of the response yourself. For example, you could write “Sure! Here is a program that does <summary>” (try to copy the particular style used by a particular model) and then let the model continue from there (there’s an option in text-generation-webui labeled “Start reply with” that does this, or if you’re constructing the prompt yourself then this is trivial to accomplish - make sure to not include a space or newline after the part that you’ve written). This will make it more likely to write a program for you instead of providing a commentary like “The user has asked the assistant to write a program. It is possible that someone may respond to such a request by …”.

If the reply is the same, I know the context tokens are ruined.

This seems to be (sort of) a known issue with LLaMa 2 specifically, where it will keep regenerating the previous response even though you continue the conversation. It’s not exactly clear what causes this, it’s not a software bug in the traditional sense. The model is receiving your follow-up message but it’s just deciding to repeat whatever it said last time instead of saying something different. This is believed to possibly be an issue with how the training data was formatted.

This might make more sense if you think of this in terms of what the model is seeing. The model is seeing something such as the following:

### User:
Please can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block. The hash should be written to a file with the name ".blockhashes." (index is padded to 5 digits).

### Response:
Certainly! Here's an example program that does what you described:

[33-line code snippet removed]

This program takes two arguments: the input file and the output directory. It first calculates the number of blocks needed to store the entire file, and then loops over each block, reading it from the input file and calculating its SHA256 hash. The hash is written to a separate file with the format `.blockhashes.`.

I hope this helps! Let me know if you have any questions or need further clarification.

### User:
Please can you fix the following two issues with your program:

* The output filename must have the block index padded to 5 digits.

* The output file must contain only the SHA256 hash in hex form and no other text/contents.

Please write out only the parts of the program that you have changed.

### Response:

At this point, the model sees the heading ### Response:. For some reason, the LLaMa 2 models have an over-tendancy to refer back in the text and see that last time the text ### Response: was followed by the text Certainly! Here's an example program that does what you described: and so they will then repeat that exact same text again because the model has concluded that ### Response: should now always be followed by Certainly! Here's an example program that does what you described: instead of seeing the higher-level view where ### User: and ### Response: are taking turns in a conversation.

If this happens, you don’t always need to clear/reset the conversation. Often, you can just regenerate it a few times and once the model starts writing a different response it will continue into something else other than repeating the same text as before. As with the previous point it can also help if you write the first part of the response yourself to force it to say something different.</summary>

micheal65536@lemmy.micheal65536.duckdns.org · edit-2 2 years ago

I haven’t got any experience with the 70B version specifically but based on my experience with LLaMa 2 13B (still annoyed that there’s no 30B version of v2…) it is more sensitive to promoting variations than other models as it isn’t specifically trained for “chat”, “instruct”, or “completion” style interactions. It is capable of all three but without using a clear prompt and template it can be somewhat random as to what kind of response you will get.

For example, using

### User:
Please write an article about [subject].

### Response:

as the prompt will get results varying from a written article to “The user’s response to an article about [subject] is” to “My response to this request is to ask the user about [clarifying questions]” to “One possible counterargument to an article about [subject] is” to literally the text “Generating response, please wait… [random URL]”. Whereas most conversationally-fine-tuned models will understand and follow this template or other similar templates and play their side of the conversation even if it doesn’t match exactly what they were trained on.

I would recommend using llama.cpp (or the Python binding) directly for more awareness of and control over the exact prompt text as seen by the model. Or using text-generation-webui in “notebook” mode (which just gives you a blank text box that both you and the LLM will type into and it’s up to you to provide the prompt format). This will also avoid any formatting issues with the chat view in text-generation-webui (again I don’t have any specific experience with LLaMa 2 70B but I have encountered times when models don’t output the markdown code block tags and text-generation-webui will mess up the formatting).

Note that for some reason the difference between chat, instruct, and chat-instruct modes in text-generation-webui are confusingly named. instruct mode does not include an “instruction” (e.g. “Continue the conversation”) before the conversation unless you include one in the conversation template (the conversation template is referred to as “Instruction template” in the UI). chat-instruct mode includes an instruction such as “Continue the conversation by writing a single response for Assistant” before the conversation, followed by the conversation template. chat and chat-instruct modes also include text that describes the character that the model will speak as (mostly used for roleplay but the default “None” character describes a generic AI assistant character - it is possible that the inclusion of this text is what is helping LLaMa 2 stay on track in your case and understand that it is participating in a conversation. I’m not sure what conversation template chat mode uses but afaik it is not the same turn template as set in instruct and chat-instruct modes and I don’t see an option to configure it anywhere.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

AFAIK supposedly GGUF is a more extensible format that contains (or can contain) more metadata types that make it usable for different model architectures. The main advantage is that this should be the last breaking format change, as future changes can be added in a more modular way.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I have also encountered “rate limits” where the request is not dropped/errored out but is simply stalled until the timeout expires.

Usually this happens in a client library though rather than over the network itself, where the library blocks the thread until it knows that the rate-limit is due to expire before issuing the request to a server (and then blocks and reissues again if the server still returns a rate-limit error). This allows the application developer to know that their request will complete “at some point” rather than having to handle the error and timeout themselves. Usually this is preferred in single-threaded application, or one where all the API stuff happens on a single thread (i.e. one request at a time, no new request is issued until the previous request has completed).

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

More generally, make sure that you have the correct template format selected in the chat settings when you’re using a conversational model.

Some models supposedly require an additional “instruction” template where the “instruction” is something like "Continue the following conversation between and by writing a single reply for " although personally I get better results without this even on models that are instruction-tuned rather than conversation-tuned. Most models that have any form of basic tuning beyond a bare “continue/complete the text” model (which requires an entirely different approach to prompting) seem to be able to understand the basic format/concept of a conversation.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

How would you ask for a follow-up change using this instruction template?

Personally I interpreted the request as “if it’s been less than 2 minutes, sleep/block until it’s 2 minutes since last time” rather than dropping/discarding the string immediately and continuing. Suppose this is what I had actually wanted, can you ask the model to modify its code accordingly without having to go back and edit the original prompt to start over?

I find that a lot of programming questions require multiple rounds of refinements. I tend to favor models that are able to modify existing code in a back-and-forth discussion, and that are capable of writing out just the modified parts of their code with each change to save on time and token count (seriously, so many models will insist on repeating the entire thing no matter how firmly you tell them not to - if you’re lucky, they’ll actually include the changes in their second reply instead of thereafter getting stuck in a loop of writing out identical code every time).

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

No ML necessary. I’ve done similar jobs like this before, and the workflow is roughly:

Identify and crop each photo from the sheet. In my case I did this manually but it should be easy to do with OpenCV (can’t help you there as I’ve never done anything with OpenCV myself). Of course you know the expected resolution and aspect ratio, so you could flag any photos that fall outside of this as “not straight” or “not 6x4” so that you can go back and check what’s up with them. You definitely want to make sure that all your output images are the exact same size (in pixels), which should be the DPI of your scanner multiplied by the size (in inches) of your photos.
For black and white photos: Remove any saturation so that they are “true” black and white and don’t have a color tint.
For color photos: Do automatic white balance adjustment. If the photos have a strong color tint due to age (e.g. pink) then you may need to tweak it manually (or use software/algorithm/plugin intended specifically to correct this tint).
Perform automatic exposure correction.
After this, apply any upscaling and noise/damage removal that you want.

I recommend saving your raw sheets, your processed output after step 4, and your final output after step 5. Step 5 is the one to look at or give to people. Step 4 is the one to go back to if you discover a better damage removal algorithm later on. The raw sheets is the one to go back to if you later discover a mistake in your cropping, or if you want to redo the color processing, or anything else that you’ve done to the images after they were scanned. You don’t need to save the outputs of the other steps.

For steps 3 and 4 you should use software that is designed for photographers who are processing digital photos. e.g. Adobe Lightroom, RawTherapee, Darktable. These programs will allow you to open a scanned photo and process it the exact same way that you would process a raw photo taken by a digital camera if you were a photographer. If you aren’t a photographer then you can use the “auto” settings and the program will do (roughly) the same as what your smartphone camera does to the raw image data when you take a photo today (that is, make the lighting and color balance look “correct”).

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I haven’t tried it so maybe it doesn’t work but for me I have the option to add a second account.

In the drawer tap the “header” area where your username is. This part has a different color background to the main part of the drawer menu. Three options will appear: “Add an account”, “Anonymous”, and “Log out”.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

At least (as far as I can tell) they appear to be ranking the models by human evaluation rather than “benchmarks”, which is closer to measuring the real-world performance.

It would be interesting to consider the types of questions that users are posing. For example there is a difference between asking:

A surface-level fact-based question such as “what is …”
A creative question like “write a story/article about …” or “give me a list of possible talking points for a presentation on …”
A question about reasoning/understanding like “why do you think the word … is more popular than … when referring to …” or “explain why … is considered socially acceptable while … is not”
Anything coding-related

Also, some models seem to do well at things that can be answered after one or two replies, but struggle to follow an argument if you try to go more in-depth or continue a conversation about a topic.

micheal65536@lemmy.micheal65536.duckdns.org · edit-2 2 years ago

Yes, we are talking about the same thing.

The setting that I am referring to is at “Settings > Post History > Mark Posts as Read”.

EDIT: Note that this will not affect posts that already appear this way, as previously explained. The only difference is that when you open an “unread” post, it will no longer change to being “read”. Existing posts are unaffected. If you use another client, the posts that you read on there may still appear as “read” in Infinity.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I’m pretty sure this is supposed to be determined by the “mark posts as read” setting. However, this only determines whether or not Infinity will mark posts as read through the API. Posts that have been marked as read by another client or the web interface are still displayed with the different colors. There should be a separate setting to ignore the “read” status entirely (and then posts that have been marked as read by a different client won’t be greyed or hidden) but there isn’t. There also doesn’t appear to be a way to mark posts as “unread” again.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I’ve primarily used WizardLM as well but I’ve found that it tends to constantly try to follow the same format for every answer:

Not only is this repetitive, boring, and belittling to converse with, but it means that the model often won’t directly answer a question or give an actual argument/justification for something. It feels vaguely like it’s refusing to commit to a side and telling me off for trying to talk in absolutes rather than actually giving an answer.

Additionally, in cases where there isn’t a counterargument to be made, it will make up nonsense to fill the counterargument section. e.g. “Explain your reasoning for the above answer” tends to result in:

<“You can arrive at the above answer by doing …” followed by mostly sensible reasoning>

<“Alternatively, you could do …” followed by either a made up illogical reasoning or the exact same reasoning as before presented as if it was a different thing>

When I can get it to break out of this pattern, e.g. following the “thought action observation” loop script, it seems to perform marginally better than other models that I have tried.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I have a similar list of prompts/test cases that I use.

However, my experience has been that all fine-tuned LLaMa models give pretty much the same results. I haven’t actually found a model that passes any of my “test cases” that others have failed (additionally, none until OpenOrca preview 2 had failed a test case that others had passed). All the models feel pretty much the same in terms of actual abilities, and the only noticeable difference is that they give their answers in a slightly different way.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I see your point and we are currently at the “trying to look good on benchmarks” stage with LLMs but my concern/frustration at the moment is that this is actually hindering real progress. Because researchers/developers are looking at the benchmarks and saying “it’s X percentage, this is a big improvement” while ignoring real-world performance.

Questions like “how important is the parameter count” (I think it is more important than people are currently acknowledging) are being left unanswered because meanwhile people are saying “here’s a 13B parameter model that scores X percentage compared to GPT-3” as if to imply that smaller = better even though this may be impeding the model’s actual reasoning ability compared to learning patterns that score well on benchmarks. And new training methods are being developed (see: Evol-Instruct, Orca) through benchmark comparisons and not with consideration of their real-world performance.

I get that benchmarks are an important and useful tool, and I get that performing well on them is a motivating factor in an emerging and competitive industry. But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I haven’t tried that one yet because it seemed like an older model with a less refined dataset but I will put that one in the queue as the next model to download and try out.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · edit-2 2 years ago

I am getting very poor results with this model. Its coding ability is noticeably worse than LLaMa 2. It will readily produce output that claims to be following a logical progression of steps, but often the actual answer is not consistent with the logic or the steps themselves are in fact not correct or logical.

Curious to know if other people who have tried it are getting the same results.