I visited the Natural Language Processing (NLP) group of University of Sheffield from 11.11.2018 to 5.12.2018, for a short-term scientific mission. The research visit was funded by the European Union's Horizon 2020 research and innovation programme under grant agreement No. 654024 (SoBigData project).

The aim of this research visit was the development of a model and analysis pipeline for the viewpoint-based summarization of controversial topics. This project is continuation of the paper: Viewpoint Discovery and Understanding in Social Networks (Web Science 2018) [1].

During the visit, I got familiarized with the several NLP functions of GATE (framework developed and maintained by the NLP group of University of Sheffield) and then developed a pipeline for the analysis of social media data related to a controversial topic (like Brexit). The pipeline comprises the following analysis steps:

  • Gathering of tweets and retweets related to a controversial topic using topic-related hashtags and/or user accounts.
  • Viewpoint discovery: clustering of the users based on their viewpoint on the topic using a multi-level graph partitioning method on the retweet graph (as proposed in [1]).
  • Extraction of top URLs and domains per viewpoint. The URLs and domains are ranked using a simple method that considers the number of retweets and distinct users.
  • Extraction of sentences and named entities from the text of the top-K URLs of each viewpoint. The GATE framework is used for named entity recognition and sentence extraction, while boilerplate removal is performed using the boilerpipe tool [2].
  • Extraction of entity pairs and graph construction for each viewpoint, where an entity pair is created when two entities co-occur in the same sentence.

The steps (i) and (ii) are executed once, resulting in a set of tweets and retweets related to the topic and a clustering of the users based on their viewpoint on the topic.

The steps (iii)-(v) are executed for a given time-period (e.g., a specific day or week) which means that, in the long run, a time-series of graphs is formed for each viewpoint, where each point in the time-series is a representation of the viewpoint at a certain period in time. This graph-based timeline representation allows inspecting how the viewpoints evolve over time as well as with respect to the involved entities.

The above pipeline was tested in two case studies related to two popular controversial topics:

  • 2016 Brexit referendum
  • 2016 US Election

The following table provides the main statistics of the gathered data (step (i) of the pipeline):

Topic Number of tweets Number of users Average Number of tweets per day Date range
Brexit 1,144,615 263,334 3.4 05.2016 – 04.2017
US Election 2,378,410 591,023 16.3 10.2016 – 02.2017

With respect to step (ii) of the pipeline, the used viewpoint discovery method extracted three viewpoints for the Brexit topic (pro-Brexit, against-Brexit, neutral) and two viewpoints for 2016 US Election (pro-Trump, against-Trump). More information about the extracted viewpoints and the used methodology is available in [1].

With respect to step (iii) of the pipeline, the case studies showed that, as expected, the top URLs of each viewpoint differ and belong to different domains. For example, regarding the Brexit topic, the domain “express.co.uk” is popular for the pro-Brexit viewpoint and the domain “theguardian.com” for the against-Brexit viewpoint. As regards the 2016 US Election topic, the domains “huffingtonpost.com” and “politicususa.com” are popular for the against-Trump viewpoint, while “breitbart.com” and “foxnews.com” are popular for the pro-Trump viewpoint.

Regarding the last two steps of the pipeline (extraction of sentences and entity pairs), the cases studies showed that, given a specific time period, different entity pairs are important for each viewpoint, while even when considering the same entity pair, the context (surrounding entities and related text) is quite different across the viewpoints. With respect to Brexit, for example, the day before the referendum (June 22, 2016) the entity pair [John Parker, John McFarlane] is one of the top pairs for the against-Brexit viewpoint (representative sentence: “Many business leaders who signed the Times letter have already expressed support, but Remain said new names include Sir John Parker from Anglo American, and Barclays' John McFarlane”). On the contrary, for the pro-Brexit viewpoint, the entity pair [Markus Kerber, BDI] seems to be very important (representative sentence: “Markus Kerber, the head of the influential BDI which represents German industry, said his organisation would make the case against such measures.”). As regards the US Election topic, the entity pair [Alessia, Donald Trump] was important for the against-Trump viewpoint the day before the election (Nov 7, 2016) (representative sentence: “I voted against Donald Trump because of 8-year-old Alessia.”), while [Bill Clinton, Obama] was one of the top pairs for the pro-Trump viewpoint (representative sentence: “Two protesters chanting “Bill Clinton is a rapist” crashed an Obama speech while blowing on “rape whistles” before being escorted out by security.”).

Future work is mainly concerned with the analysis of the derived temporal and viewpoint-specific entity graphs. The focus will be on extracting a top-K subgraph for a given viewpoint and time period (where K = number of top nodes to show). This graph can be then used for formally representing a viewpoint which in turn can help measuring the neutrality or bias of documents, search results, news providers, etc. Moreover, some steps of the analysis pipeline can be further improved. For example, some URLs are not accessible anymore or correspond to “home” pages, which means that their content is different now. A solution to this problem is the exploitation of Internet Archive’s Wayback Machine which provides access to old versions of billions of web pages.


[1] Quraishi, M., Fafalios, P., & Herder, E. (2018, May). Viewpoint Discovery and Understanding in Social Networks. In Proceedings of the 10th ACM Conference on Web Science (pp. 47-56). ACM. (PDF available here)

[2] Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010, February). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining (pp. 441-450). ACM. (PDF available here)