Fishing for biases: An experimental and theoretical inquiry with Semantic Match Ludovico Deponte Abstract: In recent years, breakthroughs in model architecture and training, large availability of data, and increased computing power, conjunctively allowed AI models not only to improve performance on previous research tasks, but also to be useful in everyday life: from translation to image and text generation, the new models are now used in the daily workflow of million of users. With this widespread adoption, it is paramount to understand the model’s workings and behaviour. While in the eXplainable Artificial Intelligence field there are already tools to tackle this problem, some of the most often used ones rely on local explanations, evaluating the dependence of the model’s outputs on one or more features of the input, data point by data point. Similar methods are especially problematic whenever explanations of single features are grouped together and used to make analogies with human concepts, in order to infer the general behaviour of the model. In turn, similar generalizations can lead to erroneous conclusions about model’s behaviour and to confirmation bias. The semantic match framework attempts to address the issue of generalization of local explanation methods by constructing global hypotheses on model’s behaviour and verifying that the evidence provided by local explanations is consistent with the considered hypothesis. If that is the case, the hypothesis matches the behaviour of the model, and it can be used for an account of the model workings. In this thesis, on one hand, we conduct an experiment to verify if it is possible to use semantic match to discover a known bias of a model; on the other, we assess the metrics of the framework, and compare them to alternatives inspired by previous work. More specifically, we start by training a biased and unbiased model on the SQuAD dataset, and check different hypotheses on their functioning against their actual behaviour. We measure the match between a hypothesis and model behaviour with two metrics, median distance and area under the curve. After reviewing the experiment results, we examine general trends of the used metrics, propose further evaluations based on them, and also apply alternative metrics. We close by discussing the advantages and disadvantages of each measure.