Judaica DH at the Penn Libraries Blog //Reviewing Sorting Phase Data: Hebrew or Arabic Script?
Blog //Reviewing Sorting Phase Data: Hebrew or Arabic Script?

Scribes of the Cairo Geniza — Help researchers prepare ancient documents for transcription!

To celebrate our volunteers’ hard work & review the data produced in the Sorting Phase, we’re sharing a series of blog posts that answer some of these questions about this project. This part reviews the question of whether a subject was Hebrew or Arabic script. Part 2 reviews the question of whether a subject was written in formal or informal script. Part 3 looks at visual characteristics on the fragments. Part 4 reviews classification tags from the talk boards.

Why are these different scripts important?

You’ll notice in the project that we were careful to refer to Hebrew and Arabic as scripts rather than language. That’s because scripts can be used to write multiple languages — as our volunteers pointed out on the project Talk boards through the use of hashtags, the Cairo Geniza contains fragments with Ladino, Judeo-Arabic, Judeo-Persian, and Judeo-Aramaic text among others. Many of the libraries that contributed to this project make this distinction in their metadata, including separate notes for script and language.

In the transcription phase, we hope volunteers with little or no expertise in these scripts can use the interface to transcribe fragments, as well as those who are fluent in Hebrew or Arabic. Sorting by script allows someone to transcribe a fragment written in Hebrew or Arabic script regardless of the language.

How many subjects were sorted into each script?

In the sorting phase, volunteers sorted 40,109 subjects from the Cairo Geniza. (All percentages given in this series are out of the total number of subjects in the project unless otherwise specified.) There were six script options for volunteers when sorting a subject: A screenshot of the sorting workflow interface. In what script is this text written? Options include: “Hebrew”, “Arabic”, “Both”, “This image looks too hard”, “There is no “text on either side of this fragment”, and “The text on the fragment is illegible.”

When volunteers first started sorting in August 2017, they only had the first four options from which to sort. The day after launch, we added the additional 2 options in order to support the variety of fragments present in the project.

Most subjects were sorted by at least 5 volunteers before it was classified and formally retired from the sorting phase. At the start of the project, we had that number as 7 volunteers, but it was adjusted in March 2018 to reflect volunteer engagement. We’ll discuss why this is important later. Pie chart of data at first glance

24,144 subjects (60%) were classified as Hebrew script,which means every volunteer who saw the subject sorted it as Hebrew.

1,065 subjects (2.6%) were classified as Arabic script,which means every volunteer who saw the subject sorted it as Arabic. This may seem like a small section of the Cairo Geniza (which includes over 300,000 fragments), but it’s extremely valuable for researchers to be able to look at this data and know where to focus their time and resources. For example, Dr. Marina Rustow, one of the content specialists on our team, focuses on the Arabic script material in the Geniza. In her research, Dr. Rustow noticed that there is a large quantity of Arabic script material that no one has looked at or identified within library collections. Through volunteer participation in this project, we can now provide a list of these Arabic script fragments, with specific counts from collections around the world.

We want to clarify that this doesn’t mean that the subjects are conclusively written in Arabic or Hebrew script — it just means that based on the set of instructions given, volunteers identified the script as such.

Of the remaining subjects, there was no definitive consensus of what script the fragment was written in, though we made some guesses based how volunteers sorted them. We’ll talk about these subjects (and what will happen to them) in the next section, but this is the breakdown in numbers:

7,690 subjects (19%) may be sorted as Hebrew script, meaning 0 volunteers sorted the subject as Arabic, but at least 1 volunteer sorted the subject as Hebrew. Volunteers may also have sorted the subject in other ways.

788 subjects (1.9%) may be sorted as Arabic script, meaning 0 volunteers sorted the subject as Hebrew, but at least 1 volunteer sorted the subject as Arabic. Volunteers may also have sorted the subject in other ways.

4,094 subjects (10.2%) were contested, meaning volunteers disagreed whether the fragment was Hebrew or Arabic. At least 1 volunteer sorted it as Hebrew, and at one least 1 volunteer sorted it as Arabic.

2,328 (5.8%) subjects fell into another category,meaning 0 volunteers sorted the subject as Hebrew or Arabic, but sorted it in other ways.

Of those 2,328 up for debate , 1,810 subjects (4.5%) fell out of the scope of the project. This means 0 volunteers sorted the subject into Hebrew, Arabic, or Both — instead, they thought the subject had no text on either side, the text was illegible, or that it was too hard. These subjects will be automatically retired from the project — you won’t see them again.

These ‘out of scope’ responses are extremely important for streamlining the transcription process. Volunteers and researchers will be able to focus on transcribing without having to struggle with fragments that are extremely difficult to read, blank fragments, or microfragments. Subject 21708151: MS-MOSSERI-I-00081-D, Genizah Research Unit, Cambridge University Library

Did volunteers agree when sorting?

General Consensus

For 26,022 subjects (64.8%), volunteers agreed 100% of the time,meaning the subject was sorted into the same category by every volunteer who viewed it. That’s impressive, considering the range of expertise from our volunteer base. For volunteers who had no experience at all, this means your best guess contributed to the community of knowledge and was, more likely than not, in agreement with others.

In addition to the counts for Hebrew and Arabic scripts we mentioned in the first section, we found volunteers agreed 100% of the time in the following categories:

Subject 12510757 was one of 308 fragments (<1%) sorted as “There is no text on either side of this fragment” by every volunteer who viewed it, and because of this, it has been retired from the project. Volunteers often marked these fragments and others with blank sides with the tags #blank and #blank_side on the Talk boards. Subject 12510757: ENA 3794, Library of the Jewish Theological Seminary

Subject 11538071 was one of 86 fragments (<1%) sorted as “The text on this fragment is illegible” by every volunteer who viewed it, and, as a result, has been retired from the project. Many of the fragments sorted as illegible were microfragments with small pieces of texts. Subject 11538071: ENA NS 79 1036.2, Library of the Jewish Theological Seminary

Subject 11608767 was one of 418 fragments (<1%) sorted as “Both” by every volunteer who viewed it. Our content specialists suggested that a fragment with both scripts likely contains more Hebrew script than Arabic script. For that reason, we decided that if volunteers sorted a fragment as having both scripts, the subject would be placed in the Hebrew transcription workflows. Subject 11583713: Halper 129, University of Pennsylvania, Herbert D. Katz Center for Advanced Judaic Studies Library, Cairo Genizah Collection

There was only one fragment ( Subject 21707750 ) sorted as “too difficult” by every volunteer who viewed it, likely because of its extreme length. (For the record, we believe it is in Hebrew script!) Because of the extreme difficulty, this fragment has been retired from the project. Subject 21707750: MS-ADD-03335, Genizah Research Unit, Cambridge University Library

What did disagreement look like?

For the remaining 14,087 subjects (35%), volunteers sorted the subject into two or more different scripts. We created this chart in order to better see how disagreement worked across the results of the sorting phase. We calculated a value for each subject ID by adding up the total number of different options the subject was sorted into (2–6) and divided it by the total number of classifications. The higher the value, the more uncertainty about the subject’s script; the lower the value, the greater the consensus. (Because of the number of subjects involved in this project, we’ve used a bar chart to display counts.)

Following that, we made a pie chart displaying the number of different responses for subjects. Screenshot of interface with six options for script

For 10,481 subjects (26.1%), volunteers chose 2 out of the 6 different responses.

For 3,051 subjects (7.6%), volunteers chose 3 out of the 6 responses.

For 522 subjects (1.3%), volunteers chose 4 of the 6 different responses.

For 33 subjects, (<1%),volunteers chose 5 of the 6 different responses. In these cases, it meant that each time a volunteer viewed the subject, the subject was sorted differently. For example, subject 11596391 was flagged for discussion on the Talk boards. Taking a closer look at the fragment below, this fragment is a challenging read — it makes sense that volunteers would struggle to make sense of the script. Subject 11596391: 
ENA 1867, Library of the Jewish Theological Seminary

Zooniverse works on consensus — most subjects were classified by at least five different volunteers. Based on the classification results, the subject is automatically sent to the appropriate transcription workflow.

9,847 of these subjects (24.5%) were sorted into Hebrew most often. In the transcription phase, all of these subjects will be available in the Hebrew transcription workflows.

802 of these subjects (22.3%) were sorted into Arabic most often. In the transcription phase, all of these subjects will be available in the Arabic transcription workflows.

779 of these subjects (1%) were sorted into Both most often. As explained earlier, these will be available in the Hebrew transcription workflows.

227 of these subjects (<1%)were tied. In the transcription phase, these subjects will be automatically retired — you won’t see them again as part of this project. Chart of General Consensus responses

Script Consensus

We were curious if, based on the instructions given, volunteers would be able to identify the differences between the two scripts — and, if so, would they agree? In the second scatter plot, we mapped consensus between the two scripts. As mentioned above, 4,094 subjects (10.3%) were contested, meaning at least one volunteer sorted it as Hebrew, and at one least one volunteer sorted it as Arabic. A volunteer may also have chosen another category.

Of those contested subjects, we found that volunteers found 2,780 subjects (6.9%) challenging,meaning not only were these subjects contested, but volunteers only sorted the fragment as Hebrew and Arabic. Volunteers never sorted the fragment into another category

We calculated consensus by finding the average between the languages — we assigned a value of 0 to Arabic and a value of 1 to Hebrew, added up the values of script classifications for each script, and divided it by the total number of script classifications for that subject. A higher value indicated consensus for the script as Hebrew; a lower value indicated the consensus for the script as Arabic.

For example, subject 11603600 (ENA 2896, JTS Library) was classified as Arabic 2 times and Hebrew 3 times, and classified a total of 5 times. Subject11603600: ENA 3717, Library of the Jewish Theological Seminary (2*0) = 0 (3*1) = 2 3/5 = .6 A score of .6 means that volunteers leaned towards classifying the fragment as Hebrew script . Being closer to the center than to a value of 1, there was high disagreement over the fragment’s script.

As you can see below, we found that in cases where subjects were contested, consensus leaned towards Hebrew. This makes sense, as we told volunteers in the tutorial that subjects would likely be in Hebrew.

Of the contested subjects , 2,607 subjects (6.4%) were classified as more likely Hebrew, meaning more volunteers identified the script as Hebrew than Arabic. In the transcription phase, all of these subjects will be available in the Hebrew transcription workflows.

169 subjects (<1%) were classified as more likely Arabic, meaning more volunteers identified the script as Arabic than Hebrew. In the transcription phase, all of these subjects will be available in the Arabic transcription workflows.

As mentioned earlier, most subjects were classified by 5 volunteers before they were retired & formally sorted. At the start of the project, we had that number as 7 volunteers. When we changed the number from 7 to 5, any subject that had already received 5 classifications had to receive one more classification before it would be retired & formally sorted. 8 contested subjects (<1%) were affected by this change,leaving the fragment still contested after retirement. In the future, we’ll share a blog post taking a closer look at some of these fragments with context from our researchers. 5 of these subjects (<1% )are available in the Hebrew transcription workflows, and 3 of these subjects (<1%)are available in the Arabic transcription workflows.

What does this mean for the Transcription Phase?

Pie chart of data moving into the Transcription Phase

This means, out of 40,109 fragments:

35,189 subjects (87.7%) have been sorted into the Hebrew transcription workflows.

1,857 subjects (4.6%) have been sorted into the Arabic transcription workflows.

3,063 subjects (7.6%) have been sorted as out of scope, and have been retired from the project entirely.

Because of human error, we’ll be on the lookout for anything that was misclassified. If you come across a subject in the transcription phase that you believe belongs on a different track, please tag the subject with a #misclassification tag on the Talk boards.

By Judaica DH at the Penn Libraries on .

Canonical link

Exported from Medium on April 14, 2020.

Cite this post: Emily Esten. “Reviewing Sorting Phase Data: Hebrew or Arabic Script?”. Published March 22, 2019. https://judaicadh.github.io//blog/sorting-phase-script/. Accessed on .