RoPE settings
The text processor seems to be configured separately from the vision encoder. I think this means that the standard RoPE configuration goes under text_config with values similar to Zephyr 7B.
RoPE configuration goes under the config.json in this case, which is where I was applying it. Due to the model being limited to only 4k context, YaRN or LongRoPE (developed by Microsoft and teased with the Phi 3 base) is preferable to maintain coherence with extreme context extension. Applying Zephyr's configuration would only yield 8k context being on a factor of 2.0. Anyways, the model was struggling with tasks longer than about 4-6K context in my testing with either method so I gave up for the time being. It was also when I noticed that llama.cpp features arguments for configuring RoPE directly, which should save time.
Ps. Transformers was complaining about llava lacking the configuration key max_position_embeddings or smth after configuring YaRN.
The max_position_embeddings figure is 4096 which gives 16384 with RoPE. The sliding_window can then be increased to match this. See here in your Zephyr config.