Name: Multilanguage metadata
Uploaded: 2024-03-06T00:00:00.0000000
Duration: T00H30M31S
Description: Multilanguage metadata

Name: Multilanguage metadata

Description: Multilanguage metadata

Thumbnail URL: https://cadmoremediastorage.blob.core.windows.net/6461d8b0-6fd0-4a7e-b374-761206a7929b/videoscrubberimages/Scrubber_1.jpg

Duration: T00H30M31S

Embed URL: https://stream.cadmore.media/player/6461d8b0-6fd0-4a7e-b374-761206a7929b

Content URL: https://cadmoreoriginalmedia.blob.core.windows.net/6461d8b0-6fd0-4a7e-b374-761206a7929b/Multilanguage metadata.mp4?sv=2019-02-02&sr=c&sig=0CeggPNWaatywKypV5HgsglxgH%2BL3otgqSrfVrX4K2M%3D&st=2024-12-26T11%3A47%3A34Z&se=2024-12-26T13%3A52%3A34Z&sp=r

Upload Date: 2024-03-06T00:00:00.0000000

Transcript: Language: EN.
Segment:0 .
It's great. Hi, everybody. Sorry it took me so long to get over here. No worries. Thank you so much, Craig. What a what a great discussion. Thanks difficult to get into Zoom somehow for me. But glad to see everybody here.
Thank you so much to everybody, to our panelists for your presentations. Welcome, everyone. Here to the live discussion on multilingual metadata and feel free to introduce yourselves in the chat. Enter your questions and comments for the panelists there or raise your hand and we can call on you to ask your questions out loud. I want to just also quickly thank the sponsors of ISO Plus 2023 and acknowledge yasushi akasaka and Rococo nakajima, who created and organized the session.
We have all our panelists here and I, I can reintroduce you, but I think everybody knows who they are. If not, we can certainly reintroduce ourselves. We had some questions in the chat in the recorded session and we'll be monitoring them here. They started out being about translation and perhaps our panelists would like to react to that.
What what you see as the role of translation and improving metadata for translations in multiple languages multiple languages of scholarly publications. Oh, and we also have a question. If you don't want to tackle translation, we have a question in the chat. What language attribute do you use for metadata for a title is multilingual and how does it affect a screen reader?
Are we using ISO language codes? Proprietary codes. Juan Pablo, do you want to address that in public knowledge projects system? Yeah I can talk a little bit on the first question a little bit and the second one to some extent without getting or being able to speak too much into the technical details for the second question.
But on the first one, let me speak a little bit around that. The role of translation certainly I think, is something that not there's both automated translations and human translations and both I think have a role to play. And I think can be helpful and to some extent in terms of the appearance of at least being able to have the content available in multiple languages. There is some of the analysis that we've been doing as part of that project I described.
We're seeing that English is still always used when there is multiple languages. English is still that over half of the cases still going to have an English translations of work. But there certainly isn't enough interfaces that are accommodating the input of translations. And I think that this is something that in all of the work that we are doing, everything actually described by all of the other panelists very much restricted and tied to the technologies that are in the back end that allow us to be able to do this work.
And both in terms of being able to input the metadata correctly and in standardized ways. And there I think automated tools could help to, for example, create additional translations of so that they can be present in the metadata and have humans be able to only check them. But then the additional challenge is that metadata, as it moves across from other platforms, also needs to be displayed in those multiple languages.
And I think maybe that gets a little bit to that second question because you can have that translations carry forward. But then if they are not the receiving systems are not capable of displaying them, they often end up choosing or prioritizing one language over another. Within what we do with the public knowledge project, we will put all of the translated languages, as you saw from the interface I showed there, where we allow entering data in any number of languages that the system will accept.
Basically, it's any language for which we've had anyone at least provide a very basic translation and we just make it available online using I think we're just using several different metadata standards on the web page itself. On the HTML, using double core, we input the languages in multiple titles so that Google Scholar picks them up as well using their proprietary meta tags. And we do use ISO standards for putting for the country or the language codes.
Those then get translated in the different outputs that we provide. So when we export things to metadata, we try to use there. There are alternate title fields as well as some of the different language attributes for the titles, abstracts and other another field. So what we end up doing is as it gets exported into other systems, we try to use whatever the fields are for that particular standard.
So crossref being the first one that comes to mind, and that's the one nice that we've been looking at where we look at. We typically we end up putting those titles in there using the alternate title for translations as well as the additional the language attribute on the title field. Right and I notice that Sophie Roy, who asked that question, has a follow up, too.
What about titles that have more than one language? And I guess we have to be we have to be certain if we're. If the metadata isn't telling us what language the publication is in, how risky is it to guess by the title? I know the International bibliography titles don't necessarily tell us all that much about the content of the article and often not even the language. We do have that issue sometimes with Hebrew, where we'll have titles that are in english, but the publication itself is in Hebrew.
Yeah we see that quite common. We've seen that quite commonly in our analysis where if they only enter one type, one language and often the system that they're inputting it into might only be allowing that one language, regardless of what language the content is actually in. Some, some publications or some authors end up choosing to put the English language title in there because I think it's a strategy for visibility.
And so this is why I was cautioning around we call this a metadata quality error. Is it is it an error? Is it a problem? It's actually an intended use of the metadata. It's just maybe not the intended use that is used by the majority, but it is the intended use of the person entering to be able to use that strategy as a strategy for discovery.
But the mixed language example is it gets tricky around are you describing in the language of the publication or the content or trying to describe all the languages that are in a particular field in an abstract or in a title? And it gets, yeah, this is where the multilingualism gets tricky and it gets messy to try to accommodate correctly one comment. So that. So Yeah and usually so in our case a data provider so specify.
So which language so they want to describe so their data. So but that's and so action so character is a different like so because so like with Japanese in Japanese titles they can include so English alphabet but it doesn't matter now. So that technically is just character coding issues. So that. So point is also OK this metadata is intended.
So special language that is exactly the point then. And coding is another problem. Of course, it's sometimes coding is pragmatic. So Unicode is so, so many things, so and not so before we have a lot of problem, especially so Japanese is very special. So that's pragmatic. Now it's most of problem solved. So so a mixture also and character in different language is not really to be of course, is very different case I understand now that's my that is my point.
Yeah Yeah. Thank you. Jen, did you have a comment? Yeah Yeah. I want to talk about my experience on multilingual metadata and full text as I operate in Korea. Science in Korea and the Korea science provide bilingual service in Korean and English.
So researchers from Western society, after leading metadata up to f tracks, sometimes they cannot read content because it was written in Korean. So sometimes they asked me to provide the full text in HTML or they want me to translate it into English. So we.
Whenever they request, I can't combat it manually. So we are developing automatic way to combat it into English HTML format. So these days many Western researchers can read Korean articles with the help of Google translation. And any issues that you've encountered?
I'm sure there are. There are many with quality and translation. It is both of the articles is science and technology. Well, so the terminology is very concise. So there is no issues. That's good. Yeah, but we had a question from Vincent. Let's see, Vincent, if you'd like to unmute and ask your question, or I can read it.
I'm sure. Thank you. If you can hear me, there's a mention that there have been challenges, as many of us know, with getting correct multi-lingual metadata in crossref. It was mentioned that the problem has actually been solved in crossref.
So for those of us who are still struggling with multilingual metadata and crossref, do you have any advice for us? Then said this is where some of the nitty gritty on some of the technical details on exactly what are the fields and so on that are recommended by Khrushchev. This is where one of my colleagues, Mike Mason, is.
He's he's the person that reads all of the scope notes for every field of cross river, and he knows those inside and out. So he'd be the best person for that. But well, I can link to in the notes in a moment a report that we put out a little bit around some of the recommendations around looking at how we've been handling this. I've seen a few minutes to be able to find it, and I can certainly point you to the resources that we've been using now as we've been analyzing, analyzing these fields.
So we'll need to try to see if I can find them while we're still on the call today. And otherwise I'll give you my contact info. So I'm happy to follow up with you on that because I know we have some of that information available, but I just don't have it right at my fingertips. I'm and yeah, my point is that yeah, I talked in my presentation is some primary language over content that's in the metadata is like an oc.
So original content is an example in Japanese. So we may attach so English metadata also that that's good for like a put so across hiking.example. So but it's not so not so original metadata that's something that's secondary so metadata so that's so in speaking. And so also I want to distinguish, OK, this is a primary. So metadata is a secondary metadata.
So that. So that is helpful for like so also when so that. So their content is indexed and search. One of the issues that came up was quality assurance for multilingual metadata or clean up.
In many cases, I was pleased to see. I think Juan Pablo's presentation had a graph that shows a decrease in the number of errors that have been spotted coming in publishers multilingual data over time. Let me ask each of the panelists to weigh in on what, if any, are some of the easily implemented things that we could do to improve the quality of the multilingual metadata?
What kind of tools are you using now, or would you like what kind of improvements would you like to see? Are you using language, guessers or spell check? We at the International bibliography have built in spell check in english, but you know, our based backend interface that very briefly will support spell checkers from a number of different languages.
I mean, again, just to bring up our biggest issue with Hebrew metadata is still what comes into us. What comes into me is very often backwards and very often I just have to correct it using a simple text reversal tool so that on our end, with Hebrew in particular, and I know this also happens with Arabic.
That specifically is the biggest issue because transliteration is not in that metadata category, at least the way that we operate. Right Yeah as an aggregator also metadata. So we don't usually we don't modify anything so because we just collect some metadata so that.
So we don't need any tools. So, but so of course it's a chance to check like a so like availability of such metadata items. Like, so some item is missing like so and like an application in English is maybe often missed. Yeah, that's so that's some check is possible, but not currently. We are not so serious on this issue, so we are just collecting them and indexing the moment.
I know also that you had a figure where there was a fairly large number of publications where the language was unknown. And you said that was presumed to be Japanese. Is there any interest in either putting up a barrier or a minder or some sort of filter where you might be able to reduce the number of citations where it's at?
It's mainly because like a data transfer buy from other systems because our interface system so because one system so adopt sexo language so options. So as a system, not just to cover such so option, but so system outside. So do not have such functionality then. So we regard it.
So OK, this is unknown. So that's of course, is it technically solvable like so is easy to guess like so language code. So it is something a point so we can improve our system. One thing we are starting to do now with our software and part of it now a project that's just getting launched in Europe specifically focused on trying to improve metadata even further.
Multilingual and handling within the BCP application is having a few more built in some checks to look, for example, for the input in making sure that fields are the language fields are entered. So automatic checks that you can do to see if a language feel that's entered does it appear to match? Because you can automatically try to detect the language. And at least alert and have the person confirm if it looks like there is a mismatch.
So we'll be building in some additional checks at because we are one of the platforms that sort of is a generator of metadata. Right those are the starting point, right, from the authors themselves. And I think that there's room to have some standardized sort of checks that are put into place into more systems. Part of it is so that could be a set of a sort of tool set that a set of practices that get adopted by other manuscript management systems or repository softwares around just certain checks that you don't have to stop the publishing if someone intends to do something, but at least alert when it appears that certain errors are present.
And then the other thing that I would say in terms is not so much a tool, but something that we can all do is I think there needs to be there needs to be more sort of outrage and demand for multilingual interfaces be present in different software for collecting, because far too much of our scholarly communication system presumes monolingual. And it almost certainly presumes that that language is English. And so then you have everyone working in another language, trying to shoehorn what they do.
And that, I think, is a cultural change, not so much a technological one, but there isn't the demand from the majority to say that this is something that they expect in all of the systems that they interact. You OK. In this session I realize the importance of so and improve quality, especially so language issue even in our data because we are mainly working for a domestic users.
So it doesn't matter for us. So it is really nice. But our data of course now also deliver to other system. So probably so correct. So long attack is more important. Yeah, that's all we learned in this session. So we can discuss insights. So how to improve. So our system.
Yeah Yeah. I mean, I even see issues here of equity and access. You know, as Juan Pablo points out, if interfaces for Scholarly Publishing and manuscript management are geared towards English or a single language, I mean, there may be systems out there that we're not even aware of that are geared toward a language other than english, that we're great for that but aren't interoperable with other languages.
That's a missed opportunity, and it's a barrier that needs to be overcome. Yeah Steve I don't have any other questions here in the chat. Sophie yes, please go ahead. Hi I love all the examples that you're sharing. I thought I'd share one for us.
I'm in Canada and we do publish in English and French and occasionally the document. Well, a single document that includes both English and French. Multilingual as the Korean example was shown. And we're using mods to mark up the data, the metadata, and it only allows for one language tag.
And if we put well, I don't even know if there is such a thing as bilingual, but it wouldn't mean anything to anybody. And we don't want to put English because that's not really what. And we cannot put two. So we choose to put None. And so in the first talk, when there was that mistake found that no language was indicated. We're actually doing that on purpose.
Interesting Yeah. So, yes, systems need to still evolve to allow for these things. Thank you very much. Very interesting. Thank you. So that's a great example of what I was saying, where it appears I think anybody looking at it without understanding the cultural context in which it emerges out of and the intentional choice of people in it would consider it's a metadata error.
Right it's like, oh, here is incompleteness and metadata. The metadata is incorrect or inaccurate, but in reality it's is an intent like it's trying to reflect something and it's trying to reflect. I don't want to declare that English or French is the dominant language of this document, and you're shoehorning this. This is the challenge with a lot of these examples that were there.
Lovely to hear all of the diversity of what people are wanting to express. And then the constraints that the metadata imposes and the system that both and that's our challenge, I think, right now. And I'm kind of grateful that this kind of conversation happening and giving space to talking about the importance of multilingualism in scholarly publishing, because otherwise we are not going to have the push for making sure that the systems are able to incorporate and accommodate this diversity while still being able to function and allow us to make use of the metadata in the ways that we want.
But I don't think that the two things are irreconcilable. But but yeah, I really appreciate that examples of. Yeah, yeah, Yeah. Thank you for Sophia and Sophie. So Yeah. So bilingual text is a very good example. So good. So challenge also. Yeah because yeah, we assume it's like a single language, so content, but so in Canadian case is also French and English.
So equally important that case. Yeah and yeah, in my opinion I said so we need some so like mean so like so we need a primary language. So that's in Canadian case, it's probably so two language. So must be treated equally. That's, that's another case of some multilingual metadata. So sometimes we need so primary language.
So sometimes so so like an echo. So a query. So a ranking of language. So probably so multilingual metadata should treat such so priority over language. So by default. Yeah Yeah. Thanks for Sophie. Yeah, right. When I subscribe, when I'm indexing websites for the bibliography, I run into that sometimes where there either be two languages where it's the same content in both languages, where they're equally prioritized.
Sometimes it will be using one language to describe a text that's in a full electronic edition in another language. So yeah, there are I have encountered instances of. How do you describe that? Yeah. Yeah for I mean, if you have a website that has lots of Latin texts from the middle ages, you want to list Latin as a language of the website itself.
Right in fact, I have had to index a handful of Latin texts editions on an English website. And again, it's how do you describe that? Is that useful to someone searching? I should acknowledge to that. Vincent has pointed out in the chat that ISO does have the language code MKL that can be used for multiple languages, but you have to indicate which languages are involved in a different field.
But yeah, there is that sort of escape in any case. So we're getting close to the end of our allotted time. But I think we've identified several items and I'm not keeping track of them. But I hope Mary Beth is that certainly our opportunities for nice so to move forward and to put together a team to look at and to examine and maybe come up with recommendations.
I think the words that Todd read earlier at the beginning of the conference about NASA's mission, about providing unfettered access to information, certainly one of those factors to overcome is, is language. It can bring us together, but it can also put us into silos. And and we want to get past that as much as we can. Final thoughts, comments from any of our panelists, any of the attendees?
OK? Yeah. Well, I got some time to talk here, so. Yeah, I interesting talks here. Like, so and so. Of course. So region have so different problem like. So in our case in Korean case or so, like an hybrid case is over the Canadian case very stable.
So we have probably that's still just examples here. So we should explore should include more defined. So language so issue so to get some good some good solution before that's my comment. You for I'm happy to hear. Thank you. That's it. You likewise. I appreciate it very much.
Hearing from all of your different cases from around the world, having been spending some time digging into some of these different things and seeing we see the metadata and then try to make some assumptions as to or try to understand, OK, why would they have done this? But it's really helpful and interesting to be able to hearing these things around the world. And I'll just let you finish also with that, just an invitation for anyone that wants to engage with PKKP around these issues in particular.
You know, some of you are indexing content. That is something may be generated from our server. But if you see we're always looking for ways of trying to make our software be more inclusive while adhering to standards, but in a way that is able to accommodate more of the different cultural context in which our software is used. So really an open invitation to reach out. My email is there in the chat to me or to come to our website and find the different ways of engaging in the form and where with suggestions and ideas as to what are other things that we can be doing, checks, fields, whatever it is that we can do that would better accommodate the diversity that's out there.
So that's an invitation not just to our panelists, but to everyone in the audience as well, to reach out. Well, thank you. Thank you very much, everybody. It was a terrific session and I appreciate your attending it all different hours around the world in different time zones. Thanks very much.
Take care. Thank you. Here, everybody.

Cadmore media player playing video Multilanguage metadata

Video Player

Transcript

Segments

End of Video Player Control