What information is hidden in online news articles?

Data analysis of online news

Andreas Stöckl
Towards Data Science

--

Length and volume of online news per weekday and time / Image by Author

The news published online by daily newspapers is an important source of information. Not only do they contain the statements to be disseminated, but also implicitly other information about the publisher and its employees. This flow of information is usually not intended, and the publishers are not even aware of it.

These are not secret hidden messages embedded in individual messages, as some people believe to find secret messages in Beatles songs, but information that is only apparent when a large amount of data is viewed together and correctly combined. In this article, I would like to show this with some examples.

I will find in the article, for example, references to the fact that editors of the daily newspaper “Der Standard” like to sleep longer on weekends, and write longer articles in the morning or on weekends in the morning. The rest of the day seems to be dominated by agency reports.

Especially the newspaper “Kronen Zeitung” but also the portal “oe24.at” publish, not unexpectedly, hardly any longer articles. The fact that here, in contrast to the “Standard”, it is about ten times as much is a bit surprising.

From the publications of editors named by name, information can be gained about their vacation behavior or which other editors they might be close to.

The data

As data for the examples I have selected the news articles of the online edition of three Austrian daily newspapers:
- “Österreich” — www.oe24.at
- “Kronen Zeitung” — www.krone.at
- “Der Standard” — www.derstandard.at
The first two can be assigned to the Boulevard, the latter is called a quality newspaper.

Over a period of 3 months (11.8.2020 to 9.11.2020), I collected the texts of the news, together with some metadata like publication date, author, etc. This resulted in a data volume of:
- 10,933 articles from derstandard.at
- 12,990 articles from krone.at
- 29.868 articles from oe24.at

The first overview of the data

To get a first overview of the data, we will look at the number of articles published by the different newspapers each day.

All 3 newspapers considered show a weekly cycle, which is most pronounced in the “Standard”, on weekends and holidays there is much less publication. On average, just over 100 articles are written. On oe24.at since the beginning of October significantly more.

Articles on derstandard.at per day / Image by Author

When are articles written and how long are they?

Now let’s look at the time of day and the day of the week when the news was published. The size of the dots describes the number of articles. The color code also indicates the average text length (number of words) per time. Blue circles stand for short articles, the darker the red tone, the longer the articles are.

Length and volume of online news per weekday and time — derstandard.at / Image by Author

Publication is mostly during the day, and on weekends we start a little later. Here the editors seem to want to sleep a little longer. In general, less is published on weekends.

From the coloring, one can see that the editorship of the “standard” obviously takes itself in the morning of each day and on weekends time to write long articles, as expected from a “quality medium”.
Are mainly short agency reports distributed later in the day?

Length and volume of online news per weekday and time — oe24.at / Image by Author

The next picture shows a slightly different picture for the editorial office of “Österreich — oe24.at”, here there are only a small number of longer articles 6 o’clock in the morning, shortly before midnight and Friday noon.
What kind of articles are these?
Here too, more is published on weekdays.

Length and volume of online news per weekday and time — krone.at / Image by Author

On which topics do you publish?

But I don’t only want to know when and with which text length the publication will be made, but also about which topics. For this purpose, I have subjected the individual articles to an automatic topic assignment.

The analysis was done with the “News Intelligence Platform” from “Aylien” and used “IAB” as categorization. This categorization was developed in order to assign the right content to online advertisements.
In my example, only the main categories were used.

Now let’s look at how many articles were published in the most important categories by the three news producers. The size of the circles reflects the total number in the period under consideration.

Topics on derstandard.at / Image by Author
Topics on oe24.at / Image by Author
Topics on krone.at / Image by Author

Now we will have a look if there are any differences in when the articles on the different topics will be published. The following chart shows for the “Standard” the distribution of publications over time for the two most common categories “Politics” and “Sports”.

Articles per hour and topic for derstandard.at / Image by Author

The number of articles for the sports sector increases much more slowly in the morning than for politics. Is this because there is not too much to report about sports in the morning, or because sports editors like to sleep in?

The data for the articles of “Krone” and “Oe24.at” show the same picture.

If you also assign the individual articles to a “topic map”, you get a picture of how the different articles are distributed. The following graphic shows the distribution of the articles and the most important categories for the “Standard”.

Map of articles and topics for derstandard.at / Image by Author

Nearby points stand for articles that are similar, and the colors reflect the topics.
For this illustration, a “Sentence Embedding” was calculated for the title of each article, which encodes the meaning of the title. More about this in the article:

Subsequently, a dimensionally reduced 2D plot was generated using the “t-SNE” method. More about this in the article:

What about the length of the articles on the topics in the different media?

The average text length for the “Kronen Zeitung” hardly varies from topic to topic and is also significantly shorter than the standard. At “Oe24.at” the picture is similar with the exception that the articles are generally a bit longer, and there are longer articles for the “Automotive” section.

Text length for krone.at / Image by Author

The average text length for the “Standard” is ten times longer than for the “Kronen Zeitung”. The data confirm the prejudice that tabloids hardly ever provide text. The “standard” also has significant differences in length between the subject areas. “News” articles, for example, are significantly shorter than the rest.

Text length for derstandard.at / Image by Author

Can we say something about individual persons or parts of the editorial staff?

Some newspapers mark the articles with the names of the editors or with parts of the editorial staff. In the case of the Kronenzeitung, for example, the articles can be assigned to the individual federal states’ editorial offices. This makes it clear how active the individual state editors are.

Articles per state / Image by Author

It can be seen that over the entire period considered, the Viennese central editorial office is the most active, and provinces such as Burgenland and Vorarlberg contribute only a little.

If one evaluates articles that show the name of the author, personal statements can be made. In the following figure, the names are therefore made unrecognizable.

Articles per author and day / Image by Author

From the evaluation some interesting information can be taken, then one recognizes easily firmly employed editors and free coworkers by the number of the contributions. The color code can be used to assign people to thematic areas, and gaps in the publications can indicate vacations. In this way, indications of shared vacations in the editorial department can also be collected from such graphics. Do these point to a closer personal relationship between the persons?

Questions and information such as the latter indicate the danger of the evaluations. Only publicly freely available data was used. The explosiveness is only created by the aggregation and linking of many data together with the appropriate visualization. The human being is then able to draw conclusions on this basis thanks to his great pattern recognition abilities.

For example, a competitor can identify the subject areas of editors in order to recruit them. A supervisor can spy out private connections of employees, and much more.

--

--

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/