Salma Mayorquin PRO

salma-remyx

https://remyx.ai

smellslikeml

AI & ML interests

None yet

Recent Activity

posted an update about 6 hours ago

VQASynth is the open source implementation of the https://huggingface.co/papers/2401.12168 paper, putting together the data synthesis pipeline behind https://huggingface.co/remyxai/SpaceQwen2.5-VL-3B-Instruct, https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B, and several other spatial reasoning models we've shared on here on HF. Here's how we use Remyx AI to build and improve VQASynth from the original concept forward. Stage 1: When you connect a repo to Remyx, we extract development milestones from the commit history. For VQASynth, that surfaces the moments we changed how scenes get parsed, how captions get generated, how spatial relations get encoded. Those milestones power personalized recommendations for methods semantically relevant to improving your system. Stage 2: When the model is serving in production, that same commit history delineates changes so you can learn from quasi-experiments through observational outcomes. This generates causal evidence about which changes drove which outcomes, sharpens recommendations, and supports inference on questions you haven't directly tested. Stage 3: Once teams are running controlled experiments, the intervention outcomes tighten those estimates further. Stage 4: When A/B testing becomes the operational bottleneck, we instrument decision points in the production system to explore via counterfactual perturbations. Initially in shadow mode, and after passing audits, with live traffic. If you want recommendations tuned to your own project context, you can set up a feed here: https://docs.remyx.ai/platform/discover/feed

posted an update 2 days ago

SciCrafter measured something AI practitioners have intuited: frontier agents are improving at executing inside well-framed problems, but lag at framing the problem in the first place. GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop. So the authors ran targeted interventions: * Hints about what to investigate doubled performance. * A structured experimentation template added 7-14 more points. * Structured consolidation beat free-form summaries by 6 points. * Curriculum context beat independent task-solving. These interventions helped the agent frame what’s worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution. Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically. Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search. In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next. One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4

updated a dataset 9 days ago

salma-remyx/vqasynth_testing_evals_eval

View all activity

Organizations

Posts 28

Post

VQASynth is the open source implementation of the SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities (2401.12168) paper, putting together the data synthesis pipeline behind remyxai/SpaceQwen2.5-VL-3B-Instruct, remyxai/SpaceThinker-Qwen2.5VL-3B, and several other spatial reasoning models we've shared on here on HF.

Here's how we use Remyx AI to build and improve VQASynth from the original concept forward.

Stage 1: When you connect a repo to Remyx, we extract development milestones from the commit history. For VQASynth, that surfaces the moments we changed how scenes get parsed, how captions get generated, how spatial relations get encoded. Those milestones power personalized recommendations for methods semantically relevant to improving your system.

Stage 2: When the model is serving in production, that same commit history delineates changes so you can learn from quasi-experiments through observational outcomes. This generates causal evidence about which changes drove which outcomes, sharpens recommendations, and supports inference on questions you haven't directly tested.

Stage 3: Once teams are running controlled experiments, the intervention outcomes tighten those estimates further.

Stage 4: When A/B testing becomes the operational bottleneck, we instrument decision points in the production system to explore via counterfactual perturbations. Initially in shadow mode, and after passing audits, with live traffic.

If you want recommendations tuned to your own project context, you can set up a feed here: https://docs.remyx.ai/platform/discover/feed

Post

4949

SciCrafter measured something AI practitioners have intuited: frontier agents are improving at executing inside well-framed problems, but lag at framing the problem in the first place.

GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop.

So the authors ran targeted interventions:
* Hints about what to investigate doubled performance.
* A structured experimentation template added 7-14 more points.
* Structured consolidation beat free-form summaries by 6 points.
* Curriculum context beat independent task-solving.

These interventions helped the agent frame what’s worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution.

Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically.

Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search.

In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next.

One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4