خلاصه سازی استخراجی متن با استقاده از مجموعه الگوریتم‌های خلاصه‌سازی و روش Sa-TRB

نوع مقاله : مقاله پژوهشی

نویسندگان

1 دپارتمان فنی مهندسی دانشگاه تبریز - تبریز - ایران

2 استاد گروه مهندسی کامپیوتر - دانشکده مهندسی برق و کامپیوتر دانشگاه تبریز- تبریز- ایران

چکیده

خلاصه‌سازی استخراجی متن یک تکنیک ضروری در پردازش زبان طبیعی است که با استخراج مهمترین جملات به تولید نسخه‌های فشرده از متن کمک می‌کند. در خلاصه‌سازی استخراجی جملاتی که حاوی اطلاعات مفید و مرتبط هستند برای خلاصه نهایی انتخاب می‌شوند. به منظور شناسایی این جملات الگوریتم‌های متفاوتی وجود دارند که عملکرد و خلاصه ایجاد شده از هرکدام بر اساس نوع متن و اندازه خلاصه مورد نیاز متفاوت است. در این مقاله روشی با نام Sa-TRB ارائه شده‌است، که برگرفته از دو الگوریتم TextRank و BERT بوده و علاوه بر استفاده از این دو روش از اشتراک جملات ایجاد شده سایر الگوریتم‌ها نیز بهره می‌برد تا دقت بالایی در انتخاب جملات خلاصه نهایی داشته باشد. مهمترین معیار برای ارزیابی عملکرد الگوریتم‌ها کیفیت خلاصه نهایی آنهاست، چنانکه هر چقدر خلاصه نهایی ایجاد شده توسط این الگوریتم‌ها به خلاصه ایجاد شده توسط انسان مشابه باشد، کیفیت خلاصه ایجاد شده بهتر است. برای به دست آوردن اندازه این تشابه از معیارهای روش ROUGE استفاده می‌شود. در نهایت با انجام آزمایش‌هایی روی دیتاست cnn-dailymail با اندازه خلاصه‌های مختلف نشان داده می‌شود که روش پیشنهادی با افزایش اندازه خلاصه مورد نیاز با وجود کاهش در معیار فراخوانی دارای دقت، امتیاز و در نتیجه کیفیت بالاتر خلاصه نهایی است، به طوری که در دو آزمایش آخر که نرخ فشردگی 20 و 25 درصد است، امتیاز روش پیشنهادی به 24.68 و 23.34 درصد رسیده است که تقریبا یک درصد از بهترین روش‌های آزمایش شده دیگر بهتر است.

کلیدواژه‌ها


عنوان مقاله [English]

Extractive Automatic Text Summarization using integrated set of algorithms and Sa-TRB method

نویسندگان [English]

  • Abolfazl Sadrolsadati 1
  • Mohammad-Reza Feizi-Derakhshi 2
1 Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Tabriz, Iran
2 Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Tabriz, Iran
چکیده [English]

Extractive summarization of text is an essential technique in natural language processing, which helps to produce compact versions of text by extracting the most important sentences. Since the task of shortening and summarizing a text document is time-consuming and exhausting, an automatic system for creating these short versions of the text seems necessary. In extractive summarization, sentences that contain useful and relevant information are usually selected for the final summary. In order to identify these sentences, there are different algorithms, the performance and summary created by each one is different based on the type and scope of the text and the size of the required summary. In this article, a method called Sa-TRB is presented, which is derived from two algorithms, TextRank and BERT, and in addition to using these two methods, it also uses the common sentences created by other algorithms to achieve high accuracy in selection. Have final summary sentences. The most important criterion for evaluating the performance of algorithms is the quality of their final summary, so the more the final summary created by these algorithms is similar to the summary created by humans, the better the quality of the created summary is. ROUGE criteria have been used to obtain the size of this similarity. Finally, by conducting experiments on the cnn-dailymail dataset with different sizes of summaries, it is shown that the proposed method, by increasing the size of the required summaries, despite the decrease in the recall criterion, has accuracy, score and, as a result, higher quality of the final summaries. So, in the last two tests, the score of the proposed method has reached 24.68 and 23.34%, which is almost one percent better than the best tested methods.

کلیدواژه‌ها [English]

  • TextRank
  • BERT
  • LSA
  • Sa-TRB
  • ROUGE
[1] S. Soumya, S. Kumar, R. Naseem and S. Mohan, Automatic Text Summarization, In: Das V.V., Thankachan N. (eds) , Computational Intelligence and Information Technology. CIIT 2011, Communications in Computer and Information Science (CCIS), Springer, Berlin, Heidelberg, Volume 250, pp. 787-789, 2011.
[2] R. M. Aliguliyev, N. R. Isazade and N. Idris, COSUM: Text summarization based on clustering and optimization, Expert Systems: The Journal of Knowledge Engineering, Volume 36, Issue 1, 2019.
[3] H. Jing, Sentence Reduction for Automatic Text Summarization, Sixth Applied Natural Language Processing Conference. Association for Computational Linguistics, pp. 310–315, 2000.
[4] K. Knight and D. Marcu, Summarization beyond sentence extraction: A probabilistic approach to sentence compression, Artificial Intelligence, Volume 139, Issue 1, pp. 91-107, 2002.
[5] I. Mani, Automatic summarization, John Benjamin’s Publishing Company, Amsterdam/Philadelphia, 2001.
[6] P. Kouris, G. Alexandridis and A. Stafylopatis, Abstractive Text Summarization Based on Deep Learning and Semantic Content Generalization, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 5082–5092, 2019.
[7] A. See, P. J. Liu and Ch. D. Manning, Get To The Point: Summarization with Pointer-Generator Networks, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073-1083, 2017.
[8] S. Singhal and A. Bhattacharya, Abstractive Text Summarization, pp. 1-11, 2015.
[9] J.N. Madhuri and R. Ganesh Kumar, Extractive Text Summarization Using Sentence Ranking, 2019 International Conference on Data Science and Communication (IconDSC), IEEE, pp. 1-3, 2019.
[10] M. Gambhir and V. Gupta, Recent automatic text summarization techniques: a survey, Artificial Intelligence Review, Volume 47, Issue 1, pp. 1–66, 2017.
[11] S. Ghodratnama, A. Beheshti, M. Zakershahrak and F. Sobhanmanesh, Extractive document summarization based on dynamic feature space mapping, IEEE Access 2020, 8. [CrossRef], pp. 139084–139095, 2020.
[12] H. P. Luhn, The automatic creation of literature abstracts, IBM Journal of research and development, 2(2), pp. 159-165, 1958.
[13] H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM J. Res. Dev.1957.1.[CrossRef] pp. 309–317, 1957.
[14] R. Mishra, J. Bian, M. Fiszman, C.R. Weir, S. Jonnalagadda, J. Mostafa and et al., Text summarization in the biomedical domain: A systematic review of recent research, J.Biomed. Inform. 52, pp. 457–467, 2014.
[15] S. Afantenos, V. Karkaletsis and P. Stamatopoulos, Summarization from medical documents: a survey, Artif. Intell. Med. 33 (2) pp. 157–177, 2005.
[16] V. Gupta and G.S. Lehal, A survey of text summarization extractive techniques, J. Emerg. Technol. Web Intell. 2, pp. 258–268, 2010.
[17] E. Lloret and M. Palomar, Text summarisation in progress: a literature review, Artif. Intell. Rev. 37 (1), pp. 1–41, 2012.
[18] R.A. García-Hernández, R. Montiel, Y. Ledeneva, E. Rendón and A. Gelbukh, Cruz, R. Text Summarization by Sentence Extraction Using Unsupervised Learning., In Proceedings of the Mexican International Conference on Artificial Intelligence, Atizapán de Zaragoza, Mexico, 27–31 October 2008; Springer: Berlin/Heidelberg, Germany, 2008.
[19] M. Fiszman, D. Demner-Fushman, H. Kilicoglu and T.C. Rindflesch, Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topicoriented evaluation, J. Biomed. Inform. 42 (5), pp. 801–813, 2009.
[20] H. Zhang, M. Fiszman, D. Shin, C.M. Miller, G. Rosemblat and T.C. Rindflesch, Degree centrality for semantic abstraction summarization of therapeutic studies, J. Biomed. Inform. 44 (5), pp. 830–838, 2011.
[21] H. Christian, M.P. Agus and D. Suhartono, Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF)., ComTech: Computer, Mathematics and Engineering Applications, 7(4), pp. 285-294, 2016.
[22] J. Ramos, Using tf-idf to determine word relevance in document queries., In Proceedings of the first instructional conference on machine learning, Vol. 242, pp.133-142, December 2003
[23] P. Bafna, D. Pramod, and A.Vaidya, Document clustering: TF-IDF approach., In 2016 International Conference on Electrical, Electronics, and Optimization Techniques IEEE (ICEEOT), pp. 61-66, March 2016.
[24] R. Mihalcea and P. Tarau, Textrank: Bringing order into text., In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 25–26, July 2004.
[25] P. Zha, X.Xu and M. Zuo, An Efficient Improved Strategy for the PageRank Algorithm., In Proceedings of the 2011 International Conference on Management and Service Science, Bangkok, Thailand, pp. 7–9, May 2011.
[26] N. Moratanch and S.A Chitrakala, survey on extractive text summarization. In Proceedings of the 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), Chennai, India, pp. 10–11, January 2017.
[27] R. Mihalcea and P. Tarau, Textrank: Bringing order into text., In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404-411, July 2004.
[28] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine., 1998.
[29] C. Mallick, A. K. Das, M. Dutta and A. Sarkar, Graphbased text summarization using modified TextRank., In Soft Computing in Data Analytics. Springer,Singapore, pp. 137-146, 2019.
[30] F. Barrios, F. López, L. Argerich, and R. Wachenchauzer, Variations of the similarity function of textrank for automated summarization., arXiv preprint arXiv:1602.03606, 2016.
[31] L. Yao, Z. Pengzhou and Z. Chi, Research on
News Keyword Extraction Technology Based on TF-IDF and TextRank., In 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS). IEEE Computer Society , pp. 452-455, June 2019.
[32] R.K. Roul, J.K. Sahoo and K. Arora, Modified TF-IDF term weighting strategies for text
categorization., In 2017 14th IEEE India Council International Conference (INDICON), pp. 1-6, December 2017.
[33] Y.L. Chang and J.T. Chien, Latent Dirichlet learning for document summarization., In 2009 IEEE international conference on acoustics, speech and signal processing, pp. 1689-1692, April 2009.
[34] R. Arora and B. Ravindran, Latent Dirichlet allocation based multi-document summarization., In Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 91-97, July 2008.
[35] R. Kumar and K. Raghuveer, Legal document summarization using latent dirichlet allocation., International Journal of Computer Science Telecommunications. 3, pp. 114-117, 2012.
[36] A.C. Onwutalobi, Using Lexical Chains for Efficient Text Summarization., Available online: https://ssrn.com/abstract=3378072, accessed on 16 May 2009.
[37] J.L. Neto, A.A. Freitas and C.A.A. Kaestner, Automatic text summarization using a machine learning approach., In Proceedings of the Brazilian Symposium on Artificial Intelligence, Porto de Galinhas, Recife, Brazil, 11–14 November 2002, Springer: Berlin/Heidelberg, Germany, 2002.
[38] H.J. Jain, M.S. Bewoor and S.H. Patil, Context sensitive text summarization using k means clustering algorithm., Int. J. Soft Comput.Eng, 2,pp. 301–304, 2012.
[39] L.Yao, Z.Pengzhou and Z.Chi, Research on News Keyword Extraction Technology Based on TF-IDF and TextRank., In 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp. 452-455, June 2019.
[40] F. Barrios, F.López, L. Argerich, and R. Wachenchauzer, Variations of the similarity function of textrank for automated summarization., arXiv preprint arXiv:1602.03606, 2016.
[41] S. R.Manalu and A.M.Sundjaja, Review assessment support in Open Journal System using
TextRank., JPhCS, 801(1), 012074, 2017.