experiment.pl

The main script is experiment.pl – a Perl script. It progresses in 5 steps:
  1. Find frequent classes
  2. Collect triples per entity and per class
  3. Find linguistic patterns per class
  4. Collect frequent linguistic patterns for each frequent class
  5. Create the rules
All relevant parameters are set via the hash $CFG. The script ran be run without command line paramaters via perl experiment.pl .

The script requires a few Perl modules to be installed: YAML::Syck, IO::Uncompress:Bunzip2, URL::Encode, Number::Bytes::Human, and Text::CSV.

Configuration

We explain what the script does (and which intermediate files it creates) for the following configuration:
        min_entities_per_class          => 100,
        max_entities_per_class          => 10000,
        min_onegram_length              => 4,
        min_pattern_count               => 5,

        min_anchor_count                => 10,
        min_propertyonegram_length      => 4,
        min_propertypattern_count       => 5,
        min_propertystring_length       => 5,
        max_propertystring_length       => 50,

        min_supA                        => 5,
        min_supB                        => 5,
        min_supAB                       => 5,

        rulepattern => {
                predict_l_for_s_given_po                => 0,
                predict_po_for_s_given_l                => 0,
                predict_localized_l_for_s_given_po      => 0,
                predict_po_for_s_given_localized_l      => 0,

                predict_l_for_s_given_p                 => 0,
                predict_p_for_s_given_l                 => 0,
                predict_localized_l_for_s_given_p       => 0,
                predict_p_for_s_given_localized_l       => 0,

                predict_l_for_s_given_o                 => 0,
                predict_o_for_s_given_l                 => 0,

                predict_l_for_o_given_sp                => 0,
                predict_sp_for_o_given_l                => 0,
                predict_localized_l_for_o_given_sp      => 0,
                predict_sp_for_o_given_localized_l      => 0,

                predict_l_for_o_given_s                 => 0,
                predict_s_for_o_given_l                 => 0,

                predict_l_for_o_given_p                 => 1,
                predict_p_for_o_given_l                 => 1,
                predict_localized_l_for_o_given_p       => 1,
                predict_p_for_o_given_localized_l       => 1,
        },
According to this configuration, only those classes are considered that have at least 100 instances ("min_entities_per_class => 100"). In the case that a class has more than 10.000 instances, then exactly 10.000 instances are randomly selected ("max_entities_per_class => 10000").

When extracting n-grams from a document, then those 1-grams are ignored that consist of less than 4 characters ("min_onegram_length => 4"). Those n-grams that occur less than 5 times in documents about entities of a class are ignored ("min_pattern_count => 5").

Before localized patterns can be extracted, the arguments of a relation need to be identified. In the case that an argument is an entity (instead of a literal), then this entity can be identified via its rdfs:label or via its anchor text. We only consider those anchor texts that in Wikipedia refer to an entity at least 10 times ("min_anchor_count => 10").

Once we have identified the arguments of a relation, we extract the string between the mentions of the arguments. We ignore those strings that consist of less that 5 or that consist of more than 50 characters ("min_propertystring_length => 5", "max_propertystring_length => 50"). We ignore those n-grams extracted from this string that consist of less than 4 characters ("min_propertyonegram_length => 4"). Those n-grams that occur less than 5 times in documents about entities of a class are ignored ("min_propertypattern_count => 5").

In step 5 where the rules are learned, rules are discarded where the support on the left hand side of the association rule is below 5 ("min_supA => 5"), rules are discarded where the support on the right hand side of the association rule is below 5 ("min_supB => 5"), and rules are discarded where the support on the joint event (AB) is below 5 ("min_supAB => 5").

One can configure which rule patterns to be mined. According to the configuration above, no rules for the rule pattern predict_l_for_s_given_po are extracted ("predict_l_for_s_given_po=> 0"), whereas rules for the rule pattern predict_l_for_o_given_p are extracted ("predict_l_for_o_given_p => 1"). Note that once the first four steps of the algorithm are completed, one configure that only for some of the rule patterns rules are extracted. Then one can run the code and while running, configure another set of rule patterns and run the code. Then, mining is carried out for both sets (or, a larger number of sets) of rules in parallel.

experiment.pl - Step 1

In Step 1 the DBpedia files short-abstracts_lang=en.ttl.bz2 and instance-types_lang=en_specific.ttl.bz2 are analyzed and the following files are created: The file entities_with_abstract.yml contains a list of all entities for which a dbo:abstract exists. Excerpt:
--- 
http://dbpedia.org/resource/!!!: 1
http://dbpedia.org/resource/!!!_(album): 1
http://dbpedia.org/resource/!Action_Pact!: 1
http://dbpedia.org/resource/!Arriba!_La_Pachanga: 1
http://dbpedia.org/resource/!Hero: 1
http://dbpedia.org/resource/!Hero_(album): 1
http://dbpedia.org/resource/!Oka_Tokat: 1
http://dbpedia.org/resource/!PAUS3: 1
http://dbpedia.org/resource/!T.O.O.H.!: 1
The file frequent_class_to_entities-100-10000.yml contains a list of frequent classes and for each frequent class a list of its sampled instances. Excerpt:
---
http://dbpedia.org/ontology/AcademicJournal:
  http://dbpedia.org/resource/100_Word_Story: 1
  http://dbpedia.org/resource/19th-Century_Music: 1
  http://dbpedia.org/resource/306090: 1
  http://dbpedia.org/resource/4OR: 1
  http://dbpedia.org/resource/A+BE: 1
  http://dbpedia.org/resource/AAACN_Viewpoint: 1
  http://dbpedia.org/resource/AACN_Advanced_Critical_Care: 1
  http://dbpedia.org/resource/AACN_Nursing_Scan_in_Critical_Care: 1
The file entity_to_frequent_classes-100-10000.yml lists for each sampled entity that belongs to a frequent class the list of frequent classes it belongs to. Excerpt:
--- 
http://dbpedia.org/resource/!Hero: 
  http://dbpedia.org/ontology/Musical: 1
http://dbpedia.org/resource/$24_in_24: 
  http://dbpedia.org/ontology/TelevisionShow: 1
http://dbpedia.org/resource/$50SAT: 
  http://dbpedia.org/ontology/SpaceMission: 1
http://dbpedia.org/resource/$pread: 
  http://dbpedia.org/ontology/Magazine: 1
http://dbpedia.org/resource/%22...And_Ladies_of_the_Club%22: 
The file step1-100-10000.time contains the number of seconds that passed between the start of step 1 and the end of step 1.

experiment.pl - Step 2

TODO describe step 2. 2020.12.01/anchor-texts-sorted-counted-reversed.txt.bz2, 2020.11.01/infobox-properties_lang=en.ttl.bz2, 2020.12.01/mappingbased-objects_lang=en.ttl.bz2, 2020.12.01/mappingbased-literals_lang=en.ttl.bz2, 2020.12.01/instance-types_lang=en_specific.ttl.bz2, populates the folders data_per_entity and data_per_class step2-100-10000-10.finished step2-100-10000-10.time /data_per_entity/mbly/dbr-108th_Delaware_General_Assembly-sub-10.ttl
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/name> "C. Douglass Buck"@en .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://www.w3.org/2000/01/rdf-schema#label> "108th Delaware General Assembly"@en .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/name> "Governor"@en .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/office> "108"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/predecessor> "107"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/successor> "109"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/termEnd> "1937-01-05"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/termStart> "1935-01-08"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/ontology/termPeriod> <http://dbpedia.org/resource/108th_Delaware_General_Assembly__Tenure__1> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://xmlns.com/foaf/0.1/name> "C. Douglass Buck"@en .
/data_per_entity/mbly/dbr-108th_Delaware_General_Assembly-obj-10.ttl
<http://dbpedia.org/resource/107th_Delaware_General_Assembly__Tenure__1> <http://dbpedia.org/ontology/successor> <http://dbpedia.org/resource/108th_Delaware_General_Assembly> .
<http://dbpedia.org/resource/109th_Delaware_General_Assembly__Tenure__1> <http://dbpedia.org/ontology/predecessor> <http://dbpedia.org/resource/108th_Delaware_General_Assembly> .
data_per_class/dbo-Politician/sub-100-10000.ttl.bz2. for the class dbo:politician and the sampled instances of that class, the file contains all triples with such an instance in subject position.
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/name> "C. Douglass Buck"@en .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/name> "Governor"@en .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/office> "108"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/predecessor> "107"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/successor> "109"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/termEnd> "1937-01-05"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://dbpedia.org/resource/108th_Delaware_General_Assembly> <http://dbpedia.org/property/termStart> "1935-01-08"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://dbpedia.org/resource/126th_Delaware_General_Assembly> <http://dbpedia.org/property/name> "Governor"@en .
<http://dbpedia.org/resource/126th_Delaware_General_Assembly> <http://dbpedia.org/property/name> "Russell W. Peterson"@en .
<http://dbpedia.org/resource/126th_Delaware_General_Assembly> <http://dbpedia.org/property/office> "126"^^<http://www.w3.org/2001/XMLSchema#integer> .
data_per_class/dbo-Politician/obj-100-10000.ttl.bz2 for the class dbo:politician and the sampled instances of that class, the file contains all triples with such an instance in object position. Excerpt:
<http://dbpedia.org/resource/(Much)_Wenlock_(UK_Parliament_constituency)> <http://dbpedia.org/property/candidate> <http://dbpedia.org/resource/George_Weld-Forester,_3rd_Baron_Forester> .
<http://dbpedia.org/resource/10th_Philippine_Legislature> <http://dbpedia.org/property/governorGeneral> <http://dbpedia.org/resource/Frank_Murphy> .
<http://dbpedia.org/resource/10th_arrondissement_of_Marseille> <http://dbpedia.org/property/leaderName> <http://dbpedia.org/resource/Guy_Teissier> .
<http://dbpedia.org/resource/111th_Infantry_Brigade_(Pakistan)> <http://dbpedia.org/property/notableCommanders> <http://dbpedia.org/resource/Yahya_Khan> .
<http://dbpedia.org/resource/111th_Virginia_General_Assembly> <http://dbpedia.org/property/chamber2Leader> <http://dbpedia.org/resource/Richard_L._Brewer_Jr.> .
<http://dbpedia.org/resource/113th_United_States_Congress> <http://dbpedia.org/property/caption> <http://dbpedia.org/resource/John_Cornyn> .
<http://dbpedia.org/resource/114th_United_States_Congress> <http://dbpedia.org/property/caption> <http://dbpedia.org/resource/John_Cornyn> .
<http://dbpedia.org/resource/115th_United_States_Congress> <http://dbpedia.org/property/caption> <http://dbpedia.org/resource/John_Cornyn> .
<http://dbpedia.org/resource/118th_New_York_State_Legislature> <http://dbpedia.org/property/speaker> <http://dbpedia.org/resource/Hamilton_Fish_II> .
<http://dbpedia.org/resource/119th_New_York_State_Legislature> <http://dbpedia.org/property/speaker> <http://dbpedia.org/resource/Hamilton_Fish_II> .

experiment.pl - Step 3

Step 3. processes 2020.07.01/short-abstracts_lang=en.ttl.bz2 show example result files. dbr-Ekiti_State_House_of_Assembly-patterns-100-10000-4.yml.bz2 dbr-Library_of_Grand_National_Assembly-propertypatterns-100-10000-4-10-4-5-50.yml.bz2

experiment.pl - Step 4

Step 4. aggregation.

experiment.pl - Step 5

create rules.
Once experiment.pl ran successfully, it can be convenient to run compress_entity_files.pl to compress the files that contain the triple that have been collected per entity. Thereby, less storage is used. However, this step can be ommitted.
Once experiment.pl ran successfully, the script statistics.pl can be can via the command line as perl statistics.pl .