A multi-block masking strategy determines which patches the context encoder sees and which ones it must predict. First, three prediction blocks are sampled, each covering 15–20% of the patch grid, with randomized aspect ratios so the model can’t learn to expect a fixed shape. These blocks can overlap each other, so together they select roughly 35–50% of the grid as prediction targets. After that, a large context block (80–100% of the grid) is sampled, and the prediction regions are subtracted from it. The result is that the context encoder typically sees around 40–55% of the patches.
Browser with a superpower
,详情可参考爱思助手
Россиянка сломала ногу в популярном магазине и отсудила у него миллионы рублей14:47
║ ╔═ file2 ═══╗