HILT Hypotheses

What follows is a series of e-mails detailing hypotheses, ideas and guidance on issues related to research within the scope of the HILT Project. The idea behind the writing of such hypotheses is that stakeholders and project group members will then help to identify data that may allow us to refute some hypotheses and/or lend support to others. The hypotheses are based on discussion which arose with both the HILT Project Management Group and the HILT Steering Group:

For the most recent HILT hypotheses please consult:

9. Hypotheses Version 2
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>

The e-mails below are listed in chronological order so as to retain the strand of discussion:

1. HILT: Request for hypotheses
From: Susannah Wake <susannah.wake@strath.ac.uk>

2. RE: HILT: Request for hypotheses
From: Chris Rusbridge <c.rusbridge@compserv.gla.ac.uk>

3. Re: [Fwd: RE: HILT: Request for hypotheses]
From: "Dr P.B. Watry" <P.B.Watry@liverpool.ac.uk>

4. HILT hypotheses
From: "Craven, Louise" <louise.craven@pro.gov.uk>

5. Hypothesis
From: Alan Gilchrist <cura@fastnet.co.uk>

6. Summary of HILT Hypothesis (1st Draft)
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>

7. Re: Summary of HILT Hypothesis (1st Draft)
From: Rachel Heery <lisrmh@ukoln.ac.uk>
Followed with more perspectives from UKOLN by Rosemary Russell
.

8. Re: Summary of HILT Hypothesis (1st Draft)
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>

9. Hypotheses Version 2
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>

10. FW: Summary of HILT Hypothesis (1st Draft)
From: Stella G Dextre Clarke <SDClarke@LukeHouse.demon.co.uk>

Forwarded by Dennis Nicholson 25/01/01

11. Re: Hypotheses Version 2
From: "MacEwan, Andrew" [Andrew.MacEwan@BL.UK]

Top


Subject: HILT: Request for hypotheses
Date: Wed, 20 Dec 2000 10:55:17 +0000
From: Susannah Wake <susannah.wake@strath.ac.uk>
To: sg hilt <lis-hilt-sg@jiscmail.ac.uk>
CC: mgt hilt <lis-hilt-mgt@jiscmail.ac.uk>

Dear All,

As requested I am sending you an e-mail to remind you about one of the actions we decided to carry out at the Steering Group meeting on the 14th December. The action was for each member of the group to devise a hypothesis for a possible solution to the problem of cross-searching by subject across communities. Alternatively any ideas on what wouldn't work will also be accepted. These do not need to be in-depth, just have a good brainstorm.

Included below, for reference, are a couple of hypotheses we have developed. If you like you could also tell us why these would or would not work. It would be beneficial to all if we could conduct this through the mailing list.

Hope that you all have very festive Christmas and New Year,

Susannah. p.s. Thanks to those who have sent us cards.

Hypothesis1: Possible solution is that each community co-ordinates its own efforts towards creating standards and a limited set of classification and subject schemes, and thesauri. Then, when a search site is set up to search across these communities, there will be a more manageable number of schemes mapping will facilitate mapping to an easier extent. In essence each sector much ensure controlled vocabulary is centrally amended and updated.

Hypothesis 2: Use a universal faceted classification scheme such as UDC as a language-independent core or intermediary concept language. Build a universal macro-thesaurus on it (there may already be one such as the UNESCO thesaurus; alternatively use DDC and LC subject headings, which are already mapped to each other). Have each subject community use their own most established thesaurus or thesauri to create microthesauri off the macro thesaurus. This process can continue down to more detailed levels of thesauri for more specific subject areas. Different sectors with different needs can either come in at the macro level (for general museums and archives) or work in with the micro level if they cover specific subject areas.

Top


 

Subject: RE: HILT: Request for hypotheses
Date: Wed, 20 Dec 2000 16:44:01 -0000
From: Chris Rusbridge <c.rusbridge@compserv.gla.ac.uk>
To: 'Susannah Wake' <susannah.wake@strath.ac.uk>
CC: "'LIS-HILT-SG@JISCMAIL.AC.UK'" <lis-hilt-sg@jiscmail.ac.uk>

I have some rather more pessimistic Rusbridge Hypotheses. Don't read too much into the pessimism; the optimistic hypotheses have already been put forward...

R1: the general mapping of thesauri is impossible in any accurate way, since it represents a particular case of the machine translation problem, which notoriously fouls up on ambiguous terminology (remember the joke about 'the vodka is good but the meat is rotten').

R1a: Machine translation is a fruitful field to explore in HILT!

R2: even if thesaurus mapping is possible, it will be of little help to the average searcher, given what is known about searching habits. In particular, it is difficult to persuade searchers to map their specific term to a more general one, or to make use of 'advanced' search facilities.

R2a: use of thesauri to broaden terms will only increase the deluge of useless information emerging from queries, and it will be even harder to find the answer amongst the dross.

R3: even if searching would be helped, those who prepare the metadata cannot be relied upon either to adhere to the thesauri nominally in use, or to bring their data up to date (search any largish library catalogue by author to see the number of name variants you find).

R4: browse structures represent some form of classification (often informal and irregular); browse structures fail to the extent that there is a mismatch between the classification scheme and the browser's mental models. Being able to re-build browse structures according to a different classification scheme that more accurately represents the mental model would help browsers navigate more easily.

R5: there exists a high level thesaurus, but there is imperfect mapping between it and domain specific thesauri. This mismatch will be detrimental rather than helpful to the searcher in many cases (although the mapping may be helpful in others).

R6: this is a case where the flexibility of the human brain is paramount. A good librarian will be of more assistance than an unfamiliar thesaurus!

R7: re-phrasing the issue in terms of ontologies will help (I could have put this in the negative as well).

R8: the answers can only be found through user testing with searchers, browsers and metadata managers.

--
Chris Rusbridge
Director of Information Services,
University of Glasgow
GLASGOW G12 8QQ
phone 0141 330 2516 fax 0141 330 5620
email: C.Rusbridge@compserv.gla.ac.uk

Top


 

Subject: Re: [Fwd: RE: HILT: Request for hypotheses]
Date: Fri, 22 Dec 2000 09:43:11 +0000 (GMT)
From: "Dr P.B. Watry" <P.B.Watry@liverpool.ac.uk>
To: Susannah Wake <susannah.wake@strath.ac.uk>
CC: LIS-HILT-SG@JISCMAIL.AC.UK, c.rusbridge@compserv.gla.ac.uk

Hello

I think that Chris is largely on the money with these comments, particularly about use of high level thesauri. There appear to be three major difficulties.

1. The very pressing need to train data contributors how correctly to use thesauri in order to construct control access points. Many archivists involved with the HE Archives Hub, for example, made up unofficial extensions to the UNESCO thesaurus without any form of versioning control. Also, many didn't appear to understand that it is best to maintain thesauri independently of the services they support.

2. The related question of getting end-users to use thesauri effectively to get to the information they need. Here, there needs to be a variety of search strategies which are transparent, including 2-stage hypertext browsing, support for relevance ranked searches, and capability to browse subject access headings. All of which are now implemented for the HE Archives Hub (which supports both LCSH and UNESCO data). The most recent research suggests that "clustering" related subject headings may be the most effective way forward.

3. The final question of "mapping" one thesaurus onto another. We are carrying on experiments in this direction. But I have to say that LCSH for all its faults seems to be the most appropriate thesaurus for English language data sets. To me it seems the best solution for many data contributors is to create "official" extensions to LCSH which suit their needs and register them at the Library of Congress, where they can be properly versioned, distributed, and supported.

For the HE Archives Hub we constructed a Cheshire Z39.50 resource for LCSH and UNESCO. See http://gondolin.hist.liv.ac.uk/~cheshire/lcsh and http://gondolin.hist.liv.ac.uk/~cheshire/unesco. You may find it interesting to have a look at the initial search page of the HE Archives Hub with the new subject browsing capabilities. http://www.archiveshub.ac.uk

I will be away for my annual leave for the first three weeks of January.

Best wishes

Paul

Top


 

Subject: HILT hypotheses
Date: Wed, 3 Jan 2001 15:53:21 -0000
From: "Craven, Louise" <louise.craven@pro.gov.uk>
To: "'susannah.wake@STRATH.AC.UK'" <susannah.wake@strath.ac.uk>

Happy New Year!

I'm starting from a position which accepts that Hypothesis 1 and Hypothesis 2 can be achieved, then identifies a problem which exists in the use of thesauri I'm familiar with, and then applies the problem to both hypotheses.

Apologies for being so detailed and micro rather than macro!

Problem: how will a HILT cope with words whose meanings have changed over time?

Where there is a clear change in meaning, and a clearish date, an artificial cut off date could be used as a qualifier (as some cataloguers of medieval docs do with the use of toponyms, epithets and patronymics, to mark the development of surnames), or the context could be given in parentheses after the term (as with homonyms.)

but

where there is no clean change and where the older meaning may be retained in a current compound, this is not so easy to do

eg intelligence

currently according to OED defined as ' the faculty of understanding' and found in UNESCO as a next narrower term under the whole micothesaurus

Psychology

but in 15th -18th Century usage, intelligence means 'news or information'

This usage is however, currently maintained in 'military intelligence' (OED: ' The obtaining of information ; the agency for obtaining secret information 1697, (Revived in modern wars))' , as also in 'marketing intelligence'

How are these sequential and variant meanings to be provided for?

Within the UNESCO thesaurus, military intelligence can be provided for by adding the term and by adding see alsos with upward posting. (The differences in meaning here though are not related in the strict associative sense, or in the ISO 2788 sense found as 8.4.3 Terms belonging to different categories)

4.10 Psychology
intelligence
see also information sciences
see also military intelligence

5.05 Information sciences
Information
see also psychology
see also military intelligence

6.45 Civil, Military and mining engineering
Military Engineering
NT 1 Military intelligence
see also information sciences
see also psychology

LCSH has

Intelligence
Intelligence Quotients

and

Intelligence Service
Military intelligence

similar to UNESCO's 4.10 and 6.45, but again a term/relationship which accommodates the historical meaning of intelligence as news/information needs to be added.

If you apply the problem now to Hypothesis 1 and 2:

Hypothesis (1) Clearly in this instance these terms could be mapped, but the searcher looking for 'intelligence' (we know not of what meaning) would be faced with a set of relatively complicated relationships re one term in two thesauri. If this is magnified in terms of similar examples and numerous thesauri, are we making a structure which is unewieldy and unhelpful to the user? These are after all relatively high level terms.

Would a browse facility at the outset of the HILT reduce relationship links and possible ambiguity which searches would throw up?

Hypothesis (2) Can a language independent core or an intermediary concept language deal with these kind of changes in meaning?

Best wishes

Louise Craven

Louise would like to add that in using these examples she was not referring to mapping as a whole but to instances "i.e. the changing of meaning." She believes that mapping is possible but points out that "there are many problems which need addressing/ for which we need to find solutions."

Top


Subject: Hypothesis
Date: Wed, 10 Jan 2001 18:21:02 -0000
From: Alan Gilchrist <cura@fastnet.co.uk>
To: LIS-HILT-MGT@JISCMAIL.AC.UK

A hypothesis for HILT

Preamble

HILT is, essentially, an exercise dealing with probability and granularity. Unless huge funds are made available, the approach must be pragmatic, and possibly iterative in the implementation, moving towards an acceptable level of performance rather than striving for perfection.

There are three levels of granularity:
1. An outline description of the stakeholder's collection.
2. The classification/indexing language used by the stakeholder.
3. The set of classification/indexing decisions made by the stakeholder.

One, above, might provide a useful first filter. With respect to 2. and 3. above, there are probably only four options available for achieving correlation between the stakeholder collections:
1. A complete merger/reconciliation of the language schemes (hugely impracticable)
2. A partial reconciliation of the language schemes (recommended below)
3. The creation and use of a switching mechanism or "intermediate lexicon" through which indexing decisions would be translated (hugely impracticable, and unlikely to be effective given the heterogeneous nature of the stakeholder collections)
4. The use of "intelligent" software to crawl the indexing decisions made by each stakeholder (pre-supposing that they are all in electronic format). Such an agent would probably be in the family of automatic categorisers/text miners and be rule-based (problem of uncertainty of efficacy and probably a prohibitive cost, unless a friendly software house could be persuaded to finance an experiment).

Hypothesis

1. Establish subject-based profiles of each collection, possibly in the form of a matrix - generic subjects vs. form. (The recent questionnaire might provide some or all of this data).
2. Select a common, "neutral" classification scheme (DDC, Broad System of Ordering (BSO), UNESCO Thesaurus) as a generic benchmark.
3. Ask each stakeholder to indicate on this benchmark. those subjects which are contained in his/her collection to some reasonably significant extent; and to some previously agreed hierarchical levels (which will vary between the major subjects)
4. Consider the possibility of extending this mark-up to include numbers of items in each collection against each of the marked-up subjects (derivable from electronic records as posting frequency?).
5. Support local searching of selected collections.
6. Establish a feedback mechanism to examine failures. (For example, all or selected failures could be passed to each stakeholder to be searched independently in each collection).
7. Adjust and maintain the benchmark scheme and the mark-ups accordingly.

Summary

This hypothesis, or possible line of attack suggests that it could be practicable to work at the level of the indexing languages used (and how they are used), rather at the far more detailed and complex level of indexing decisions made by each stakeholder. In effect, the HILT would be a weighted index to the indexing languages used; and the searcher, having selected the most likely collections would then be obliged to conduct specific searches of each collection selected, using the particular indexing languages and protocols provided. [Question: Does this approach seem to be too modest in comparison with the initial aspirations of the HILT Project?]

However, there may still be reconciliation problems at the indexing language level, caused by local pragmatics, or the interpretation (even of quasi-standard schemes) in order to meet perceived local needs. (The daftest example I can think of occurred at Aslib some years ago. Because Aslib was the FID National Member, it used the UDC in its Library. The concept "Cottage industries" was classified as "Industry:Cottages"). There may well exist more legitimate examples. It is not inconceivable to imagine a report on "Money-lending in rural India" being classed under India, Economics, Sociology by three different collections with different core interests.

Top


Subject: Summary of HILT Hypothesis (1st Draft)
Date: Mon, 15 Jan 2001 14:05:37 -0000
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>
To: LIS-HILT-MGT@JISCMAIL.AC.UK

Dear All

Thank you for your helpful responses to our request for hypotheses. I have now compiled what I hope is a coherent and more or less logically arranged list of these reworded in what I trust :) is a helpful way.

I would be grateful if you would study the attached document with a view to telling me:

1. Whether I have missed something out that you said (I felt that a number of the hypotheses sent were essentially the same as each other but that, obviously, is a matter of interpretation)

2. Whether, having read this wondrous document, you are inspired to add other hypotheses to the list

3. Whether you feel I have misrepresented something you submitted

4. Whether you feel I should reword anything I have written

PLEASE DO NOT, AT THIS STAGE, SEND DISCUSSIONS OR REFUTATIONS TO THE LIST. I WANT TO GET THE LIST RIGHT BEFORE WE START THAT

Cheers

Dennis

Attached Document:

HILT Hypotheses Summary (1st draft) Pre-question: What is the problem, what is its scope?

What is required to solve the problem is:

1. A magic wand - in other words, the problem can't be solved, either because no solution will be universally acceptable or sufficiently flexible or any solution will be too difficult to implement for political or practical or economic reasons (e.g. because machine translation is required and this is impossible because of the ambiguity of many terms) or staff in different services and/or communities won't apply the agreed standard approach correctly or in the same standard fashion or the users won't use or understand the wonderful new scheme even if it is put in place and used consistently. Arguably, this hypothesis is simply the negation of all of the others and translates to 'All positive hypothesis as to a solution have been shown to be refutable therefore the problem can't be solved'. There is, however, at least one good reason to leave it in. An extreme variation of it is arguably that - despite appearances - there is no problem. I guess this translates as the users are fine as they are - we think they need the universal and consistent use of controlled terminologies for cross searching but in reality they don't, they either don't need to do it or are happy to muddle through with what they've got (which may well be more suited to their preferred approach). I don't believe this, but it is theoretically a position that might be defended in the absence of evidence to the contrary. This is arguably the hypothesis 'the problem can't be solved because there is no problem'. If it is true, or even largely true, it will, at the very least, have implications for how much trouble, effort and expense it would be worth going to solve the problem via one of the hypothetical solutions proposed below. Another variation is that many users - maybe even the majority - will find that the use of thesauri will make finding things more difficult because, since users can't map their mental terminologies to the thesauri, it will increase the number of false drops (I'm inclined to award Chris the prize for the best 'standing the accepted view on its head and giving it a good shake' hypothesis for this one J ). If it is true then it may be that, for example, 'the greatest good' is served by not having a HILT, even if some users do need it [Question: Does having a choice solve the problem? Note that the answer is not necessarily 'yes']

2. A single universal scheme that everyone, albeit reluctantly in some cases, will sign up to and apply (hot favourites here being LCSH and Unesco - not necessarily in that order, I hasten to add). Included here is the creation of "official" extensions to (for example) LCSH to meet 'local' needs so that they can be properly versioned, distributed, and supported.

3. A mapping between two or more key schemes (LCSH and Unesco?) - especially if there is an automated terminologies mapping service allowing a gradual build up and maintenance of a complex series of mappings [Question: Does Louise's example refute this by showing that successful mapping is impossible?]

4. A community based approach in which each community aims to ensure adherence to standards and a limited set of classification and subject schemes, and thesauri is required to ensure interoperability within the community and to make inter-community interoperability at least a more manageable problem by limiting the extent of variations within communities [Question 1: are communities self-defining - i.e. a community is a community if it says it is and institutions are members if they say they are?] [Note: this is a re-casting of the hypothesis previously known as 1]

5. The combined use of one or more universal subject schemes mapped not to each other but to a classification scheme such as DDC or UDC which is then used as a 'language independent core' and the basis for cross-searching between services using different universal schemes. This would probably be implemented in the context of hypothesis 4 [Note: this is a re-casting of part of the hypothesis previously known as 2 - it is also, roughly speaking, what OCLC propose (I think), although their view would be that DDC is the best choice for various reasons - especially since it supports communication across linguistic boundaries. It is also, roughly speaking, the approach suggested by Alan and others and an approach that has been considered within CAIRNS and which we may be able to test here at Strathclyde - see my earlier e-mail]

6. As for 5, except create a new universal scheme based on DDC or UDC. In this scenario, it would be possible for communities to also create more specific micro-thesauri for more in-depth searching of their collections. In this event, different sectors with different needs can either come in at the macro level (for general museums and archives) or work in with the micro level if they cover specific subject areas. [Note: this is a re-casting of the other part of the hypothesis previously known as 2]

7. Machine translation [Note: we need to define precisely what we mean by this - for example, do we include neural networks and AI agents that can learn or be taught about mappings?]

8. Not only one or other of 2-7 above and 9-14 below but also some mechanism for 'mapping' the HILT terminologies to those in the minds of the users is needed. [Question: Might it be enough to allow users to browse the terms used and do their own mapping?]

9. Not only one or other of 2-8 above and 10-14 below but also some means of ensuring that staff apply whatever solution is adopted consistently and fully understand what it is they are aiming to do - training, hierarchical checking mechanisms, metadata creation aids etc. - is needed.

10. Not only one or other of 2-9 above and 11-14 but also a mechanism to allow the user to re-build the classification/browsing structure automatically to suit their own mental terminology is needed. [Question: How would this be done?]

11. Not only one or other of 2-10 above and 12-14 below but also a coherent mapping match between the solution and existing domain specific thesauri is needed [Note that this is not the same as 6 above].

12. Not only one or other or a mix of 2-11 above and 14 below, but also a good librarian or other intermediary to translate between user and thesauri is needed.

13. Not only one or other of 2-12 above but also good user training and a suitable variety of flexible search facilities are needed (e.g. 2-stage hypertext browsing, relevance ranked searches, browsable subject headings, clustering related subject headings, linking combinations of key words, combining clusters).

14. Not only 2-13 above but also a multi-lingual capability is needed

15. A good librarian or intermediary without all these other trappings.

16. A closer analysis of the nature of the problem - for example, would re-phrasing the issue in terms of ontologies help illuminate the route to a solution? [There is an attempt to define what an ontology is in this context at http://www-ksl.stanford.edu/kst/what-is-an-ontology.html If that doesn't help, then try http://www.cs.vassar.edu/faculty/welty/papers/subjects/subject.html and look at Figure 2.1 and its legend. I'm no expert, though, so if someone out there can point to a better web page please feel free]

17. Empirical data obtained by conducting tests involving searchers, browsers and metadata managers. DMN HILT 15.01.01

Top


 

Subject: Re: Summary of HILT Hypothesis (1st Draft)
Date: Tue, 16 Jan 2001 11:24:38 +0000
From: Rachel Heery <lisrmh@ukoln.ac.uk>
To: LIS-HILT-SG@JISCMAIL.AC.UK

On Mon, 15 Jan 2001, Dennis Nicholson wrote:

> 2. Whether, having read this wondrous document, you are inspired to add other hypotheses to the list

I hesitate to add anything as there is already a long list... but perhaps there could be more mention of existing commercial products as a possible solution? Perhaps the following could be used as expansions of hypotheses 3 and/or 7?

---------------------------------------------------------------
Hypothesis: there are existing commercial products which would offer cost effective means to improve navigation of 'large digital spaces' by the end-user. There are a variety of products (e.g. WordMap [1], Autonomy[2]) which take different approaches. They are characterised by using taxonomies, whether built from a sample document dataset or from combining existing taxonomies, in order to enhance the user's natural language search.

A complementary hypothesis might be that it is prohibitively expensive to commit to a solution that involves hand-crafted metadata creation (especially if one considers trying to update existing metadata) and that the solution must be available as 'middleware' separate from the metadata repositories themselves. (This might be considered making sense from a funding viewpoint in that one could subscribe to the 'navigation enhancement' as a separate facility independent of particular services.)

----------------------------------------------------------------
The above hypotheses assume the taxonomy assists the user in gathering search terms which are then used to search a number of diverse services. I think the success of this approach in the commercial world might depend on searching against full text, but maybe it would work on (rich) structured metadata too....

I do think it would be helpful to develop 'statements of the problem we are trying to solve' in parallel with the hypotheses... are we trying to enable subject access at the 'item level' or at the 'collection level'? are we trying to accommodate the user's detailed search terms or are we making available collection strengths?

btw I noticed mention of Alan Gilchrist's report on the TFPL web site. A fitting extract from exec summary:

" At the heart of the taxonomy debate is the need to achieve a balance between the talent of the taxonomy designer, the cost of the system to implement the taxonomy and the familiarity of the users both with the system and the structure of the information itself."

http://www.tfpl.com/areas_of_expertise/__knowledge_management/taxonomies/ taxonomies.html

Title: Taxonomies for business: Access and connectivity in a wired world Authors: Alan Gilchrist, Peter Kibby, Barry Mahon and Sandra Ward Publisher: TFPL Ltd., 17-18 Britton Street, London,EC1M 5TL, UK Date: November 2000 ISBN: 1 870889 – 98 – 3 Price: £80 /$120 plus p&p

1. www.wordmap.com
2. www.autonomy.com

also see

http://www.interwoven.com/products/metafinder/description.html

Rachel

Top


Rosemary Russell <lisrr@ukoln.ac.uk> from UKOLN noted important issues for HILT to address. These include:

* As has been said, HILT has ambitious aims, and a fairly short timescale

* It is difficult to see that a single common scheme might be found to meet the requirements of such a broad group of communities

* Will people be motivated enough, to be prepared to envisage compromising and/or to undertake additional work eg mapping from a local scheme to a universal one? May take some persuasion.

* Collection owners/service providers who use specialist, detailed subject schemes may perceive a 'dumbing down' of their local 'advanced' search options when accessed by a distributed cross-searching service which uses a high-level thesaurus. (This objection has often arisen in discussions about Z39.50 searching, where service providers can be reluctant to offer 'lowest common denominator' searching, as opposed to their local specialised search interface, which may offer many more options. Research (Dig Lib?) has shown that research users are keen to have *both* - cross-searching and individual specialist database searching.)

* There is potentially a very large number of schemes for HILT project staff to analyse -- see already the A-Z list of thesauri at: http://hilt.cdlr.strath.ac.uk/Sources/thesauri.htm

* Related Renardus research issues to track: One of the things Renardus wants to do is add subject browse access functionality at the Renardus broker level. The subject scheme chosen for this is the top levels of the Dewey Decimal Classification (DDC). Use of DDC has been negotiated with OCLC Forest Press. However, most of the gateways that the Renardus broker will interrogate do not use DDC. A small group of Renardus people are meeting next week to talk about how to produce recommendations for classification mappings from the schemes used by gateways to DDC. Things to be considered include: a. how detailed the top-level view of DDC should be (this may differ between parts of the schedules); b. whether to discard the 'facet-type' features of DDC; c. how to do the mappings themselves - one-to-one mappings will not necessarily be useful. It isn't clear whether creating a useful (and scalable) browse system in this way is possible, but it is worth investigating. (MD)

Rosemary

Top


 

Subject: Re: Summary of HILT Hypothesis (1st Draft)
Date: Tue, 16 Jan 2001 14:05:36 -0000
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>
To: LIS-HILT-SG@JISCMAIL.AC.UK

Thanks, Rachel. I will incorporate into next draft.

Would you agree that it might fit into an amended and extended version of 7 (Machine Translation)?

Dennis

Top


Subject: Hypotheses Version 2
Date: Mon, 22 Jan 2001 13:29:10 -0000
From: Dennis Nicholson <d.m.nicholson@strath.ac.uk>
To: LIS-HILT-MGT@JISCMAIL.AC.UK

Dear All

I attach a second version of the HILT hypotheses document for comment. I've made changes to hypotheses 3 and 7 in line with comments from the HILT e-mail groups and expanded the introduction covering what problem we are addressing and also reworked the wording of the hypotheses in line with this.

I'd appreciate it if you could comment on the new version. Specifically:
1. Do you agree with the wording/detail in the introduction?
2. Are we still missing some hypotheses?
3. Do the hypotheses as reworked look OK - do they still mean what you intended or understood them to mean earlier?
4. Anything else that occurs to you.

Please don't comment on the truth or falsehood of any of the hypotheses as yet. Once I get responses on this version, we will look at how the results of the stakeholder and literature surveys impinge on the hypotheses and also invite comment from the three lists on whether participants have access to data that will refute a particular hypotheses.

Thanks - please try to respond by the end of the week

Cheers

Dennis

Attached Document:

HILT Hypotheses Summary
[2nd draft - changes made to 3 and 7 in line with comments on V1 plus introduction expanded and hypotheses reworked in line with this]

Introduction: What is HILT aiming to do? HILT is charged with examining the practices and circumstances of UK networked information and resource services in respect of the subject description of their materials with the aim of determining:

1. What action, if any, is required with regard to harmonising or integrating differing practices in this area in order to optimise accurate and appropriate collection, item, and sub-item level retrieval by subject in cross-searches of these services to whatever extent the staff of the services believe is required by the users they serve - in short, to determine whether there really is a problem, or a problem of any significant magnitude, and, if so, what its scope is. Implicit in this is a question of whether or not the situation is the same for all services and their users or whether the answer to these questions differs in respect of different sub-groups of the total group of services.

2. Whether any problem identified can be solved in an affordable, cost-effective, practical, politically workable way that is both sustainable and compatible with international approaches to such issues and, if so, how it may best be solved.

Hypotheses

In line with this, the initial list of HILT hypotheses are:

0. There is no problem, or no significant problem. I guess this translates as the users are fine as they are - we think they need the universal and consistent use of controlled terminologies for cross searching but in reality they don't, they either don't need to do it or can happily muddle through with what they've got (which may well be more suited to their preferred approach). I don't believe this, but it is theoretically a position that might be defended in the absence of evidence to the contrary. If this hypothesis is true, or even largely true, it will, at the very least, have implications for how much trouble, effort and expense it would be worth going to solve the problem via one of the hypothetical solutions proposed below.

1. There is a problem, and it can be solved in an affordable, cost-effective, practical, politically workable way that is both sustainable and compatible with international approaches to such issues by implementing one or other, or a mix of the ideas detailed in the list below (see 2 onwards). This hypothesis will be refuted if all of the remaining hypotheses are refuted (including 0 above). Refutation will mean that there is a problem but no way of solving it within overall requirements - either because no solution will be universally acceptable or sufficiently flexible or any solution will be too difficult to implement for political or practical or economic reasons (e.g. because machine translation is required and this is impossible because of the ambiguity of many terms) or staff in different services and/or communities won't apply the agreed standard approach correctly or in the same standard fashion or the users won't use or understand the wonderful new scheme even if it is put in place and used consistently. Another variation suggested is that many users - maybe even the majority - will find that the use of thesauri will make finding things more difficult because, since users can't map their mental terminologies to the thesauri, it will increase the number of false drops, the argument being, I guess, that 'the greatest good' is then served by not having a HILT, even if some users do need it [Question: Does having a choice solve the problem? Note that the answer is not necessarily 'yes']

2. The problem can be solved by using a single universal scheme that everyone, albeit reluctantly in some cases, will sign up to and apply (hot favourites here being LCSH and Unesco - not necessarily in that order, I hasten to add). Included here is the creation of "official" extensions to (for example) LCSH to meet 'local' needs so that they can be properly versioned, distributed, and supported.

3. The problem can be solved by using a mapping between two or more key schemes (LCSH and Unesco?) - especially if there is an automated terminologies mapping service allowing a gradual build up and maintenance of a complex series of mappings

4. The problem can be solved by using a community based approach in which each community aims to ensure adherence to standards and a limited set of classification and subject schemes, and thesauri is required to ensure interoperability within the community and to make inter-community interoperability at least a more manageable problem by limiting the extent of variations within communities [Question: are communities self-defining - i.e. a community is a community if it says it is and institutions are members if they say they are?]

5. The problem can be solved by using one or more universal subject schemes mapped not to each other but to a classification scheme such as DDC or UDC which is then used as a 'language independent core' and the basis for cross-searching between services using different universal schemes. This would probably be implemented in the context of hypothesis 4 [Note: this is, roughly speaking, what OCLC propose (I think), although their view would be that DDC is the best choice for various reasons - especially since it supports communication across linguistic boundaries. It is also, roughly speaking, the approach suggested by Alan and others and an approach that has been considered within CAIRNS and which we may be able to test here at Strathclyde - see my earlier e-mail]

6. The problem can be solved by using a similar approach to 5 but for which new universal scheme based on DDC or UDC is created so that communities can also create more specific micro-thesauri for more in-depth searching of their collections based on the scheme In this event, different sectors with different needs can either come in at the macro level (for general museums and archives) or work in with the micro level if they cover specific subject areas.

7. The problem can be solved by using a machine-assisted solution, perhaps involving middleware like wordmap (www.wordmap.com) or autonomy (www.autonomy.com) and based on taxonomies (see http://www.interwoven.com/products/metafinder/description.html and http://www.tfpl.com/areas_of_expertise/__knowledge_management/taxonomies/taxonomies.html or (possibly) neural networks or AI agents that can learn or be taught about mappings. One reason for this being the case might be that a solution based on metadata crafting may be prohibitively expensive.

8. The problem can be solved by using a mix of one or other the various solutions proposed elsewhere in this list plus some mechanism for 'mapping' the HILT terminologies to those in the minds of the users is needed. [Question: Might it be enough to allow users to browse the terms used and do their own mapping?]

9. The problem can be solved by using a mix of one or other the various solutions proposed elsewhere in this list plus some means of ensuring that staff apply whatever solution is adopted consistently and fully understand what it is they are aiming to do - training, hierarchical checking mechanisms, metadata creation aids etc. - are needed.

10. The problem can be solved by using a mix of one or other the various solutions proposed elsewhere in this list plus a mechanism to allow the user to re-build the classification/browsing structure automatically to suit their own mental terminology is needed. [Question: How would this be done?]

11. The problem can be solved by using a mix of one or other the various solutions proposed elsewhere in this list plus a coherent mapping match between the solution and existing domain specific thesauri [Note that this is not the same as 6 above].

12. The problem can be solved by using a mix of one or other the various solutions proposed elsewhere in this list plus a good librarian or other intermediary to translate between user and thesauri is needed.

13. The problem can be solved by using a mix of one or other of the various solutions proposed elsewhere in this list plus good user training and a suitable variety of flexible search facilities [e.g. 2-stage hypertext browsing, relevance ranked searches, browsable subject headings, clustering related subject headings, linking combinations of key words, combining clusters].

14. The problem can be solved by using a mix of one or other the various solutions proposed elsewhere in this list plus a multi-lingual capability

15. The problem can be solved by using a good librarian or intermediary without all these other trappings.

16. A solution can (only) be found through a closer analysis of the nature of the problem For example, would re-phrasing the issue in terms of ontologies help illuminate the route to a solution? [There is an attempt to define what an ontology is in this context at http://www-ksl.stanford.edu/kst/what-is-an-ontology.html If that doesn't help, then try http://www.cs.vassar.edu/faculty/welty/papers/subjects/subject.html and look at Figure 2.1 and its legend. I'm no expert, though, so if someone out there can point to a better web page please feel free]

17. A solution can (only) be found by obtaining empirical data obtained by conducting tests involving searchers, browsers and metadata managers.

DMN HILT 22.01.01


Subject: FW: Summary of HILT Hypothesis (1st Draft)
Date: 16 January 2001 15:20
From: Stella Clarke [mailto:sdclarke@lukehouse.demon.co.uk]
To: d.m.nicholson@strath.ac.uk

Dennis,

This comment comes not from a "stakeholder" but from an interested observer. Thus I have not seen the correspondence on the HILT management list, just your own summary. And I am not sure what is the overall scope of the project.

One option not on your list is to develop a "search thesaurus" rather than a controlled vocabulary. In other words, it would seek to give you inspiration as to keywords to try, rather than guiding you to preferred usage. There are several models of search thesaurus and no consensus. In one model, the origin of each keyword is noted (e.g. from INSPEC thesaurus, and/or ERIC, etc) so that you know which database(s) to apply it to. Even better if it had the numbers of postings! Then it would be a resource to help you choose databases as well as improve your search statement. Some imaginative things could be done with concept clustering too. The search thesaurus could be viewed as a variation on your first hypothesis, in which you give up trying to cajole users into understanding what an IR thesaurus is, and fall back on a sort of souped-up inspirational tool in the Roget mould. Sorry if all this is outside the scope of the project. I'm not pushing it as the best solution either - just another option.

Stella


Top

Subject: Re: Hypotheses Version 2
Date: 6 February 2001
From: "MacEwan, Andrew" [Andrew.MacEwan@BL.UK]

Dennis et al

A comment on the aims. I suggest introducing the word "navigation" in addition to "retrieval" by subject. Retrieval only covers the concept of inputting a subject term and matching it with items/collections indexed with that term. Navigation covers the aspect of placing a given subject in a context within a thesaurus or classification scheme - thus allowing users to use the structure to inform and redefine their searches. I think it is worth including this concept in the aims because it helps to inform judgement on what the different hypotheses can achieve for the end user.

Aim 1 could read:

"Whether it would benefit the end user to harmonise or integrate differing practices in this area in order to optimise accurate and appropriate navigation and retrieval by subject at collection, item and sub-item level."

Regards

Andrew MacEwan
The British Library

Top

 


© HILT: High-Level Thesaurus