Artificial intelligence models simulate human

 


Artificial intelligence models simulate human perception.. Has the age of conscious machines begun Understanding how the human brain works and building machines that can simulate perception and intuition has always been one of humanity's deepest ambitions. While AI models have come a long way in data processing and language, they still have a big gap in the ability to understand the physical world intuitively, something that babies easily acquire through observation. But recent research suggests that this gap is rapidly narrowing, with the development of models capable of showing surprise when the rules of physics change in front of them, that is, when irrational events occur. The V-JEPA model represents an important step towards providing artificial intelligence with an innate understanding of the world, promising a revolution in the fields of robotics and autonomous vehicles. Artificial intelligence simulates the perception of babies: In a pioneering scientific step, meta researchers have been able to develop an artificial intelligence model that demonstrates the ability to understand the basic physical principles of the world, which is known as innate intuition, which babies acquire through observation. This achievement is represented by the model, called (V-JEPA), which can be surprised when faced with physically impossible events, such as the disappearance of an object for no reason, simulating the reaction of six-month-old children to the permanence of the object. The V-JEPA model does not rely on previously programmed physical rules, but learns by watching millions of videos, just as human minds learn through experience. According to meta tests, the model can predict what will happen in videos based on Latent representations, which are abstract layers that reduce thousands of pixels into substantial information about objects, their movement and location. And if the future viewer contradicts his logical predictions, a big prediction error appears, similar to the feeling of surprise in babies. But how does the V-JEPA model differ in the viewer's understanding from traditional models? Artificial intelligence engineers, especially those developing autonomous driving systems, face a fundamental challenge in enabling the machine to understand the visual world with a reliability comparable to human perception.systems designed to analyze video content – both to classify it and to determine the contours of surrounding objects – have long relied on the so-called (Pixel space) Pixel Space, and in this space, it handles every color point (pixel) in the scene with equal weight, in a process similar to the brain receiving all sensory inputs without filtering or prioritizing. However, this approach suffers from a cognitive blind spot even though it is effective in some contexts. Imagine a complex scene of a street full of cars and traffic lights, if the model insists on processing subtle and non-essential details such as: the movement of leaves or the contrast of shadows, this will lead to the omission of the most important data, such as: the color of the traffic light or the exact location of neighboring cars. As the researchers explain, working in pixel space means dealing with a huge amount of details that do not necessarily have to be modeled, which hinders efficiency and the ability to make quick and informed decisions. To address this shortcoming, Meta has developed Video Joint Embedding Predictive Architecture, known as V-JEPA, which it will launch in 2024, to simulate an essential part of the human cognitive process, namely, selective abstraction. While traditional models block parts of video frames and train the network to predict the value of missing pixels, the V-JEPA model takes a radically different path, using the same blocking process, but it does not predict what is behind the mask at the pixel level, but predicts content based on higher levels of abstraction known as latent Representations, the philosophical and technical essence that simulates human perception. The model is based on an encoder that converts frames into a small set of digital values, representing the intrinsic features of which include: The Shape of the object, its dimensions, location, movement, relationships between the elements. Instead of thousands of pixels, the system deals only with the essence of the scene, just as the brain processes visual input by neglecting noise and focusing on useful information. Quentin Garrido, research scientist at META, emphasizes that the core of the strength of this model lies in its ability to filter data, saying: “This mechanism allows the model to drop impurities and unnecessary details, focusing instead on the most fundamental and important aspects of the depicted scene. The efficient disposal of excess information is a central goal that the (V-JEPA) model seeks to achieve with maximum efficiency. This shift from Pixel modeling to meaning modeling gives the V-JEPA model great generalization capability, high accuracy in understanding new scenes, and remarkable efficiency in complex environments such as autonomous driving or robotics. In doing so, his role is not limited to seeing the world, but understanding it, a deep step towards what human perception is like, which opens the door to the question of how close we are to the era of “conscious machines.
Simulate intuition.. When a model is as surprised as a human being: Last February, the V-JEPA team revealed remarkable results in the IntPhys test, designed to measure the ability of artificial intelligence models to distinguish between possible and physically impossible events within videos. The model achieved an accuracy of almost 98%, far superior to traditional vision models that rely on prediction in pixel space, which almost exceeded the threshold of random guessing. This result reflects not just a technical improvement, but indicates a qualitative transition from visual recognition to a deeper level of contextual understanding of the world. The researchers not only measured the accuracy of the prediction, but also went a step further, when they measured what can be called the degree of surprise of the model. He mathematically calculated the difference between what V-JEPA expects to happen in future frames and what actually happens. When scenes contained explicit violations of the laws of physics – such as a ball disappearing behind a barrier and never appearing again – the prediction error increased sharply, in a response very similar to the intuitive reaction of infants when their innate rules about the world are violated. In other words, we can say that the (V-JEPA) model seemed to be surprised by what he saw. This feature is especially important because it reflects that the model does not just memorize patterns, but builds internal forecasts of how the world will behave and shows a clear computational flaw when those forecasts are violated. However, some scientists do not consider the route to be completed. Karl Freestone, a computational neuroscientist at University College London, believes that the V-JEPA model is on track to simulate how our human brains learn and build their perceptions of the world. But at the same time, he emphasizes that this progress is still incomplete, as the model lacks some fundamental elements, and highlights, first of all, the absence of a clear mechanism to represent uncertainty. When the information from the previous frames is not enough to foresee what will happen next, forecasting by its nature becomes uncertain. The V-JEPA model does not yet have the ability to quantify or express this ambiguity, which is an essential element of any model that aspires to accurately simulate human cognition. When the available information is insufficient to predict the future, a person realizes the limits of his knowledge and treats it as an uncertainty, while the model still lacks an explicit mechanism for measuring and expressing this uncertainty. So last June, the team moved to a more ambitious level with the launch of (V-JEPA 2), a model with 1.2 billion parameters, and trained on 22 million videos. The experiments were not limited only to vision, but also extended to the field of robotics, as the model was reset using a relatively limited number of robot data, and then used to plan the following actions in simple manipulation tasks. This step indicates an emerging bridge between understanding and acting, and is one of the fundamental pillars of any perception of conscious perception. However, modern tests such as: IntPhys 2, reveal clear boundaries. When dealing with longer and more complex scenarios, the models only slightly outperformed the coincidence. This is partly due to the limited temporal memory of the model, as it can only handle a few seconds of video before it forgets the above. Such a short memory puts a ceiling on what can be a continuous perception, not to mention an integrated consciousness. are we really on the cusp of the age of conscious machines?
What models such as V-JEPA show is not the birth of artificial consciousness in the strict philosophical sense, but it is conclusive evidence of the acceleration of artificial intelligence simulation of the fundamental pillars of human cognition, namely: building logical expectations, discovering what violates those expectations, and learning from the element of surprise. This represents a major step towards machines capable of understanding the world as an interconnected system, not just a series of visual inputs. However, these models remain far from possessing a subjective experience or genuine human consciousness. Therefore, the question raised today is no longer about the possibility of reaching this path, but rather focuses on the extent to which this simulation can reach, and what limits of consciousness may the machine touch in the future

Post a Comment

Previous Post Next Post