Pre-deployment testing and capability assessments can't predict how AI systems actually behave in the wild. Here's a better framework.