Article Preview
TopIntroduction
Checking the similarity of a scientific research project to other projects is the first step in determining whether it is worthy of funding. According to statistics, the duplication rate of research projects in China is 40%(Zhang et al., 2011). The repeated scientific research projects have caused a waste of scientific research resources and affected the national scientific and technological layout.
Research project similarity discrimination algorithm is a comprehensive technology involving natural language processing, knowledge graphs, information retrieval, and other fields. Combining multi-domain knowledge and research project data helps screen existing research projects similar to those in applications (including similar research contents, research objects, and research objectives), providing a reference for reviewers and funding agencies. Current research in scientific project similarity discrimination mostly focuses on keyword extraction, text similarity calculation, and project clustering, ignoring the correlation relationships embedded in the data. There are still some deficiencies in the model design, accuracy, and query efficiency of the algorithm.
To address the above problems, this paper conducts research based on the data of completed projects and project results of the National Natural Science Foundation of China. To improve the accuracy of scientific research project similarity discrimination, we propose a method for generating fused word order sentence vectors (IUFWO) based on improved Unsupervised Random Walk Sentence Embeddings (USIF). This method can improve the semantic characterization ability of USIF by introducing part-of-speech weight and position weight and integrating word order features into sentence vectors. Based on IUFWO, this paper designs a new research project similarity calculation method. This method judges the similarity of scientific research projects by the weighted sum of cosine similarity between the project name, abstract, keywords, and the conclusion summary of scientific research projects and improves the accuracy of the similarity.
Projects submitted by scholars with close cooperation are usually more likely to be similar or duplicates. From the perspective of the query efficiency of degree discrimination, the project cooperation relationship information between scholars and entities is extracted to construct a scientific research cooperation network, which is the basis for the scientific research project similarity discrimination algorithm. This algorithm prioritizes checking for duplication of projects where a collaborative relationship exists between participants. The experimental results show that the improved sentence vector generation method is about 16% higher than the TF-IDF weighted method so that the sentence vector can more accurately express the semantics of the text. The similarity calculation method of scientific research projects makes the similarity judgment results more discriminative. Compared with the calculation method of the average similarity of each content item, it is improved by about 15.8%. The similarity discrimination algorithm of scientific research projects based on a scientific research cooperation network makes the detection process more targeted. When there are repeated projects among related scholars, the troubleshooting time is shortened by 96% on average, which improves the efficiency of large-scale checking.